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Preface 



Many challenging problems facing information systems engineering involve 
the manipulation of complex metadata artifacts, or models , such as database 
schemas, interface specifications, or object diagrams, and mappings between 
models. The applications that solve metadata manipulation problems are 
complex and hard to build. The goal of generic model management is to 
reduce the amount of programming needed to develop such applications by 
providing a database infrastructure in which a set of high-level algebraic 
operators, such as Match, Merge, and Compose, are applied to models and 
mappings as a whole rather than to their individual building blocks. 

This dissertation presents an initial study of the concepts and algorithms 
for generic model management. We describe the first prototype of a generic 
model management system, introduce the algebraic operators that are used 
to manipulate models and mappings, clarify the semantics of the operators, 
and develop novel algorithms for implementing them. In particular, we pre- 
sent an innovative algorithm based on fixpoint computation that is used for 
implementing the generic operator Match, which finds correspondences bet- 
ween two models. Using the prototype and the operators presented in the 
dissertation, we develop solutions for several practically relevant problems, 
such as change propagation and reintegration. 
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Management 




1. Introduction 



“Life is pretty simple: You do some stuff. Most fails. Some works. 
You do more of what works. If it works big, others quickly copy it. 
Then you do something else. The trick is the doing something else.” 

- Leonardo da Vinci (1452-1519) 



This chapter highlights the background of the dissertation and outlines its 
structure. In Sect. 1.1, we introduce metadata management, the general sub- 
ject of this work. The deficiencies of today’s metadata management tech- 
niques are examined in Sect. 1.2. In Sect. 1.3, we sketch the approach to 
metadata management explored in the dissertation, called generic model ma- 
nagement, and formulate our main objectives. An overview of the structure 
and contributions of the dissertation is given in Sect. 1.4. 



1.1 Metadata Management 

Metadata is descriptive information about data and applications. Metadata is 
used to specify how data is represented, stored, and transformed, or may de- 
scribe interfaces and behavior of software components. There are two kinds 
of metadata that are commonly used (Bretherton and Singley 1994). One 
kind of metadata, called structural or control metadata, is deployed prima- 
rily by computer programs. Examples of structural metadata are an interface 
definition in a programming environment or a database schema in a database 
system. The other kind of metadata, called guide metadata, is intended solely 
for use by humans and is expressed in natural language. It contains keyword 
descriptions or documentation, and is often used to facilitate information 
retrieval. The focus of this work is on structural metadata, i.e., schemas, in- 
terface definitions, and other data-structure-like artifacts that directly affect 
database or other computer system operations. 

The first use of structural metadata for data processing was reported in 
(McGee 1959). Since then, metadata-related tasks and applications have be- 
come truly pervasive. They arise in data management, website and portal 

S. Melnik: Generic Model Management, LNCS 2967, pp. 3-11, 2004. 
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management, network management, and in various fields of computer-aided 
engineering. In data management, the flagship application areas that rely 
heavily on metadata include data integration (Batini et al. 1986), data trans- 
lation (Shu et al. 1977), and database design (Wiederhold 1977). In website 
and portal management, metadata is used to generate entire websites from 
databases (Fernandez et al. 1997; Mecca et al. 1998). In network management, 
explicit models of devices and services are deployed to facilitate control of 
complex networks (Ahn 1994). In software engineering, metadata is used to 
describe the interfaces and behavior of software components (OMG 2002b). 
Feature descriptions of idealized objects such as a point mass or an ideal rope 
are utilized in physics tools (Kook and Novak 1991). In applications related to 
computation and mathematics, metadata is used to describe the properties of 
computer algorithms (Gunther et al. 1997) or discrete optimization problems 
(Blanning 1982; Becker 1996). 

In fact, metadata management plays a major role in today’s informa- 
tion systems. In addition to the aforementioned areas, its importance has 
been emphasized in the context of scientific (Shoshani et al. 1984), statisti- 
cal (McCarthy 1982), geographic (Blott and Vckovski 1995), and biological 
(Davidson et al. 1995b) information systems. The aim of metadata manage- 
ment is to support the design, manipulation, and maintenance of complex 
metadata artifacts such as database schemas, interface definitions, or website 
layouts. 

To illustrate some typical metadata management tasks consider data in- 
tegration, one of the major research topics in database systems (the tasks 
mentioned below are highlighted in italics). A key objective of data integra- 
tion is to provide a uniform view covering a number of heterogeneous data 
sources. Using such a view, the data that resides at the sources can be ac- 
cessed in a uniform fashion. This data is usually described using database 
schemas, such as relational, object-oriented, or XML schemas. To construct 
a uniform view, source schemas are matched to identify their similarities and 
discrepancies. The relevant portions of schemas are extracted and integrated 
into a uniform schema. The translation of data from the representation used 
at the sources into the representation conforming to the uniform schema is 
specified using database transformations, which may be expressed in SQL, 
XQuery, XSLT or other data manipulation languages. The queries that are 
stated against the uniform view are transparently rewritten into queries on 
sources. Should the source schemas change, the database transformations and 
the uniform schema may have to be updated accordingly. 

Examining metadata-related tasks of data integration leads to two obser- 
vations. First, these tasks are not specific to database schemas and trans- 
formations. Beside database schemas, approaches in the literature addressed 
integration of ontologies (Mitra et al. 2000; Noy and Musen 2000), knowledge 
bases (Baral et al. 1991; Subrahmanian 1994), or specifications of software 
components (Davies and Woodcock 1996). Second, the tasks that we listed 
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are not unique to data integration scenarios. Some of them have been stu- 
died in the context of different or specialized applications, such as website 
management or data warehousing, and became distinctive names in the lite- 
rature, such as schema matching, data translation, view selection, or change 
management. 

Although the nature of the metadata artifacts manipulated by metadata- 
intensive applications often differs, the addressed tasks are strikingly similar. 
For example, Roddick et al. (2000) notice that many approaches to change 
management have remarkable similarity while the subject of the change may 
be quite different. They suggest that development of conceptual modeling 
tools is needed to support change management . Other authors argue that that 
the data translation task (Atzeni and Torlone 1996) or the schema integration 
task (Barsalou and Gangopadhyay 1992) can be approached in a uniform 
fashion for a variety of schema languages. 



1.2 The Problem 

Despite the commonalities in the design of metadata-intensive tools and ap- 
plications, little progress has been made in metadata management in the 
past decades. Applications that address metadata manipulation tasks re- 
main complex and hard to build. Several major reasons contribute to their 
complexity: 

— Metadata applications are developed using low-level programming inter- 
faces. Such interfaces typically provide access to the individual elements 
of metadata artifacts, such as individual attribute definitions of database 
schemas. The programming of metadata applications against such interfa- 
ces requires extensive amount of navigational code and incurs high deve- 
lopment and maintenance cost. 

— Most approaches are application-specific. That is, adopting the code and 
infrastructure developed say for change management to data integration 
requires a major customization effort. 

— The solutions are language-specific, i.e., are developed for SQL, UML, 
XML, or RDF and are not easily portable to other domains. For exam- 
ple, solutions developed for change management of database schemas are 
hard to adopt to managing changes of websites. 

— No general-purpose platform is available to simplify the development of 
metadata-intensive tools and applications. The existing general-purpose 
solutions typically focus on persistent storage or graphical design environ- 
ments for metadata artifacts and do not go far enough to support the 
developers of metadata applications. In fact, many of today’s metadata- 
related tasks are still solved manually, because an automated approach 
requires too much implementation effort due to the lack of a common pro- 
gramming platform. 
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Akin Problems Call for Akin Solutions. To understand better the nature of 
the problems that we address and to set the stage for the approach exploited 
in the thesis, it is instructive to take a brief look at the state of the art in 
data management that prevailed three decades ago (Wiederhold 1977; Date 
1995). 

In fact, there appears to be a striking similarity between today’s problems 
in metadata management and the challenges in data management before the 
adoption of the relational model in 1970’s. At that time, data management 
applications were developed using extensive amount of navigational code, 
which was hard to write, maintain, and optimize. The same techniques were 
reapplied to one new problem after another without getting much leverage 
from each succeeding step. The data management code was embedded into 
individual applications which used incompatible storage and access structu- 
res and were not portable between different domains. The existing database 
management systems focused on persistent storage of data but offered little 
help in programming of database applications. 

The groundbreaking idea, which eventually revolutionized the database 
research field, was to raise the level of abstraction in developing data-intensive 
applications. In the late 1950’s, McGee observed that “there are certain broad 
data processing operations which are common to all or most data processing 
applications” and suggested that the key to effective data processing was in 
identifying such generic operations and making them available to application 
developers (McGee 1959, page 6). 

This idea culminated in the pioneering work by Codd (1970). Instead of 
then-common navigational access to individual records and data values, Codd 
suggested a set of algebraic operations on entire relations, such as selection, 
projection, or join. This approach allowed factoring out many similar as- 
pects of data management and free application code from ordering, indexing, 
and access path dependencies. The relational algebra helped to drastically 
simplify the programming of data-intensive applications and laid out the fo- 
undation of query optimization. In fact, the relational model and algebra are 
considered to be “the single most important development in the entire history 
of the database field” (Date 1995, page 22). 



1.3 A Vision for Management of Complex Models 

The idea of factoring out common aspects of applications by raising the level 
of abstraction worked exceptionally well for data management and is, by its- 
elf, not new. However, applying a similar approach to metadata management 
has been suggested only relatively recently. Initial thoughts on a high-level 
algebraic approach and three operators for manipulation of knowledge bases, 
Intersection, Union, and Difference, were presented in (Wiederhold 1994). 
Further operators such as Extract and Match were proposed in (Jannink 
et al. 1999) for manipulation of ontologies, dictionaries, and schemas. More 
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recently, Bernstein et al. ( 2000 b) outlined a vision to provide a truly generic 
and powerful environment to enable rapid development of metadata-intensive 
applications in different domains. They called this capability generic model 
management. 

A central concept in generic model management is that of a model. A mo- 
del is a formal description of a metadata artifact. Examples of models include 
database schemas, ontologies, interface definitions, object diagrams, control 
flow diagrams, and form definitions. The manipulation of models usually in- 
volves designing transformations between models. Formal descriptions of such 
transformations are called mappings. Examples of mappings are SQL views, 
XSL transformations, ontology articulations, mappings between class defi- 
nitions and relational schemas, mappings between two versions of a model, 
mappings between device specifications and device functions, etc. 

The key idea behind generic model management is to develop a set of alge- 
braic operators that generalize the transformation operations utilized across 
various metadata applications. These operators are applied to models and 
mappings as a whole rather than to their individual elements, and simplify 
the programming of metadata applications. The operators are generic , i.e. , 
they can be utilized for various problems and different kinds of metadata 
artifacts. Some of the major model management operators are: 

- Match: automatically create a mapping between two models. 

- Merge: merge two models into a third model using a mapping between the 
two models. 

- Extract: return a portion of a model that participates in a mapping. 

- Compose: return the composition of two mappings. 

Model-management operators can be used for solving schema evolution, 
data integration, and other scenarios using short programs, or scripts. For 
example, consider the simple script shown below: 

mi_m2 = Match(mi, TO2); 

(to, m_m\, to_TO 2) = Merge(TOi, TO2, toi_TO2); 

In the first line of the script, the models mi and m2 are “matched”. The 
result of matching is the mapping toi_TO 2 that describes the correspondences 
between TOi and m2. Then, the models are “merged”. The merging is driven 
by the mapping produced in the previous step and yields the model m and 
the mappings m_mi, m_m2 that describe how m relates to mi and m2. 

The scripts such as the one above are executed by a model management 
system. Although each of the operators can be invoked by the applications 
individually, the maximal benefit is achieved when an entire sequence of ope- 
rations is passed to the model management system as a script for execution 
and optimization. A high-level architecture of model management is depicted 
in Fig. 1 . 1 . The tools that deploy a model management system may main- 
tain models and mappings in their own repositories. In this case, models and 
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mappings that are utilized in a script need to be imported into the model 
management system before the script runs. Alternatively, the tools may ex- 
ploit the persistence capabilities of the model management system and use it 
as a shared repository (Do and Rahm 2000) . The tools remain responsible for 
the management of model instances , such as data that resides in operational 
databases, XML documents, web pages, or device specifications, and may be 
capable of executing the mappings, i.e. , transforming instances of one model 
into instances of another model. 




Fig. 1.1. A high-level 
architecture of model 
management 



If successful, generic model management may improve programmer pro- 
ductivity for metadata-intensive applications by an order of magnitude. Ho- 
wever, the vision for management of complex models raises many hard que- 
stions. In fact, at a panel that took place at the VLDB 2000 conference in 
Cairo it was debated whether the approach is feasible at all (Bernstein et al. 
2000a). Some of the questions discussed by the panelists were the following: 

— Is it feasible to develop a generic infrastructure for managing models and 
mappings? If so, what would it need to do, beyond what is offered in today’s 
database management systems and repositories? 

— Can we devise a useful generic notion of model that treats all popular in- 
formation structures as specializations (SQL schemas, ER diagrams, XML 
DTDs, object-oriented (OO) schemas, website maps, make scripts, etc.)? 

— Can we produce a generic model manipulation algebra that generalizes 
transformation operations developed for data integration, data translation, 
and data warehousing? 

— Does a generic approach offer any advantages for metadata management 
areas of current interest, such as data integration and XML? 

The questions raised in the panel set the stage for the subject of this 
dissertation. One of the conclusions of the panel discussion was that realizing 
the vision of generic model management would take years of research and 
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that substantial implementation effort and theoretical work was required to 
answer the above questions to the full extent. 

The objective of this first dissertation on generic model management is 
to demonstrate that model management operators are implementable and 
useful. The problem that we address is very challenging. It took dozens of 
Ph.D. theses, hundreds of thousands lines of code, and many years of work to 
demonstrate that the relational model and algebra were implementable and 
useful (Stonebraker 2003). We do not expect the investigation of practica- 
bility of generic model management to be any easier. In fact, an additional 
complicating factor is that the formal foundations of generic model manage- 
ment are much less clear than those underlying the relational algebra. 



1.4 Outline and Contributions of the Dissertation 

The dissertation presents an initial study of the concepts and algorithms for 
generic model management. It consists of four parts. 

Part I. We present the first implemented prototype of a programming plat- 
form for model management, called Rondo 1 . The prototype supports the exe- 
cution of model management scripts that are written using high-level ope- 
rators, which manipulate models and mappings as first-class objects. The 
usefulness of operators is studied in several model-management scenarios, 
such as change propagation and reintegration, which involve different kinds 
of models and mappings. In prior work, e.g., in (Bernstein et al. 2000b; Bern- 
stein and Rahm 2000), detailed walkthroughs of various model-management 
problems have been examined to address the question of whether metadata 
management can be done in a generic fashion. Our contribution is that we 
succeeded in making such abstract programs executable. 

Primarily, our prototype supports the developers of model-management 
solutions, by providing a high-level programming environment. However, it 
also addresses the needs of the engineers who deploy these solutions by offe- 
ring a graphical user interface (GUI) to receive their feedback in semiautoma- 
tic operations. In designing and implementing our prototype, we consciously 
focus on simplicity. We investigate how far we can go with a comparatively 
weak representation of models and mappings that can be used to solve an 
interesting class of problems. We also determine how much code is needed 
for a basic, but still useful, model management system. 

The conceptual structures and operators used in the prototype are pre- 
sented in Chap. 2. The implementation of the prototype, its architecture, 
and the algorithms that we developed are addressed in Chap. 3. The results 



1 Rondo is a musical work in which the main theme returns a number of times. We 
called our prototype Rondo to reflect the fact that different variations of similar 
metadata problems keep arising in numerous applications. 
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presented in Part I have been published in (Melnik et al. 2003a; Melnik et al. 
2003b). 2 

Part II. The operator definitions presented in Part I are largely syntactic: the 
models, such as relational and XML schemas, are represented as graphs, and 
the semantics of the operators is defined in terms of graph transformations. 
We call this semantics structural , since it is driven by the structural properties 
of models, i.e. , by the relationships between the individual models elements. 

And yet, the effect of applying “syntactic” operators to models ultimately 
needs to be expressed in terms of what the operators do to the instances 
of these models, such as entire database states. We call this other kind of 
semantics state-based semantics. Focusing on state-based semantics makes it 
possible to define the properties of operators without relying on a particular 
representation of models. 

In Chap. 4, we define the state-based semantics for models, mappings, 
operators, and scripts. We present detailed examples that illustrate the state- 
based definitions using relational schemas and SQL views. We derive alterna- 
tive formulations of operator definitions that are substantially easier to work 
with. In Chap. 5, we revisit the change propagation scenario presented in 
Part I and argue the correctness of our solution using state-based semantics. 
In Chap. 6, we discuss the state-based semantics of the conceptual structures 
and operators used in our prototype. 

Part III. Although many model-management tasks can be automated, there 
remain critical places where human decision-making is needed, e.g., to ad- 
dress the semantic heterogeneity. Thus, some of the operations are inherently 
semiautomatic and require feedback of a human engineer before, during, or 
after the operator execution. The operator Match, which establishes corre- 
spondences between models, is among the most difficult to automate. 

In Chap. 7, we present an algorithm called Similarity Flooding (SF) that 
can be used for matching of diverse data structures and is utilized for im- 
plementing the operator Match in the prototype. The input models are re- 
presented as directed labeled graphs and are used in an iterative fixpoint 
computation whose results tell us what nodes in one graph are similar to 
nodes in the second graph. For computing the similarities, we rely on the in- 
tuition that elements of two distinct models are similar when their adjacent 
elements are similar. Over a number of iterations, the initial similarity of any 
two nodes propagates through the graphs. We demonstrate the applicability 
of the algorithm for diverse matching tasks and examine its computational 
properties. 

Usually, for every element in the matched models, the SF algorithm de- 
livers a large set of match candidates. Hence, the immediate result of the 
fixpoint computation may still be too voluminous for many matching tasks. 
In Chap. 8, we examine several filters that can be used for choosing the 

2 Reprinted from (Melnik et al. 2003a) with permission from Elsevier. 
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best match candidates from the list of ranked matches returned by the SF 
algorithm. 

The evaluation and tuning of the SF algorithm is addressed in Chap. 9. We 
suggest a novel accuracy metric for evaluating automatic schema matching 
algorithms and evaluate the effectiveness of our algorithm on the basis of a 
user study that we conducted. A summary of the results presented in Part III 
has been published in (Melnik et al. 2002). 

Part IV. The individual aspects of metadata management have been studied 
extensively in the literature. The operator definitions that we give in Part I 
and the formal properties of the operators that we examine in Part II are 
inspired by the established model-management problems and scenarios, such 
as data integration, schema matching, view selection, or view complement. 

In Chap. 10, we review in detail the major related work and show how 
our operator definitions reflect the properties of the approaches suggested 
in the literature. We also sum up our prior work on declarative mediation 
that served as part of the motivation to address metadata management in a 
generic fashion. 

Generic model management is an extremely rich emerging area of research. 
This dissertation presents a first treatment of some fundamental challenging 
issues in this area. In our work, we uncovered a wide spectrum of exciting 
open problems, which are summarized in Chap. 11. 



2. Conceptual Structures and Operators 



“I can’t work without a model.” 

- Vincent Van Gogh (1853-1890) 



In this chapter, we describe the conceptual structures and operators that are 
used in the prototype of a programming platform for model management that 
we developed. The chapter is organized as follows. 

— In Sect. 2.1, we walk through a model-management scenario to motivate 
the conceptual structures and operator definitions that we present. 

— In Sect. 2.2, we introduce conceptual structures used for representing mo- 
dels and mappings. We explore a simple class of mappings between models 
that we call morphisms and suggest a new structure called selector. 

— In Sect. 2.3, we define the structural semantics of the key model- 
management operators on the conceptual structures that we introduce, 
and suggest several new generic operators. 



2.1 Motivating Scenario 

To motivate the operator definitions that we give in this chapter, we use a 
scenario that is illustrated in Fig. 2.1 and exemplifies one of the patterns that 
can be found in many metadata-intensive applications. 

Example 2.1.1. Consider an e-commerce company that needs to supply its 
purchase order data to a business partner that does the accounting, invoicing 
or data warehousing. The data is stored in a relational database according 
to a relational schema si. For the purpose of data exchange, both companies 
agree to use a common XML schema di. (The correspondences between the 
elements of schema si and d\ are depicted as light gray lines). Schema d\ 
differs from si in terms of structure and naming. 

The relational schema used by the company undergoes periodic changes 
due to the dynamic nature of its business. Assume that S 2 is a new version 

S. Melnik: Generic Model Management, LNCS 2967, pp. 13-28, 2004. 
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original 
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schema 



modified 
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-a u. 




original 

XML 
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updated 
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Fig. 2.1. Scenario illustrating propagation of changes from a relational schema to 
an XML schema 



of the relational schema si, in which columns “Brand” and “Discount” have 
been deleted, and columns “ShipDate”, “FreightCh” (freight charge), and 
“Rebate” have been added. These changes (highlighted in bold in Fig. 2.1) 
need to be propagated to the XML schema, so that g?i becomes c^- 

The change propagation described above can be done as follows. First, 
the changes introduced by S2 need to be detected, i.e., si and S2 need to 
be matched. Then, the d\ images of the elements deleted in si need to be 
removed from d\. Finally, the XML schema counterparts of the added and 
renamed columns in si need to be merged into d\ to obtain d2 . During these 
steps, intervention of a human engineer may be required, for example, to 
decide whether the new column “Rebate” should indeed be added to the 
exchange schema or is not part of the exchanged data and should be omitted. 
Still, a major portion of the work is mechanical and can be automated. 

Notice that the procedure sketched above could be applied in the re- 
verse case, when the XML schema d\ is the one that has been modified and 
the changes are to be propagated back to the relational schema si. Another 
instance of the same pattern is round-tripping the modifications from a rela- 
tional schema like Si to an existing conceptual schema of the data, which may 
be expressed as an Entity-Relationship (ER) diagram. A key idea of generic 
model management is to solve such tasks at a high level of abstraction using 
a concise generic script. 

Below we present an actual model-management script that implements 
the above solution for our change propagation scenario, and is directly exe- 
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cutable by our prototype. We will use the script to introduce the major 
model-management operators, which we define in the subsequent sections. 
To explain the individual steps of the script, we use a schematic representa- 
tion of the solution shown in Fig. 2.2. The rectangles labeled si, S 2 , di, and 
g? 2 represent the four schemas of Fig. 2.1. The arcs between the rectangles 
denote the mappings between the schemas. For example, the corresponden- 
ces between schemas si and d± in Fig. 2.1 are shown as a single arc from 
rectangle Si to d\ in Fig. 2.2. 




Legend: 

s., = original source model 

s 2 = modified source model 
d 1 = target model 
d/ = d 1 without elements deleted 
by way of s 2 
c = converted from s 2 
c' = elements added to s 2 
(after conversion) 
d 2 = updated target model 



Fig. 2.2. Schematic representation of a solution for change propagation scenario 
of Fig. 2.1 



At the bottom of Fig. 2.2, there is a schema c, which does not appear in 
Fig. 2.1. To see why it is needed, recall that si and g?i are expressed using 
two different schema languages. The new schema elements added to si by 
way of S 2 have no counterparts in schema di . That is, the new elements need 
to be converted from the source schema language to the target language. 
For example, the attribute “ShipDate” added to relation “ORDERS” needs 
to be converted to a subelement of the complex type “PurchaseOrder” in 
the XML schema. This step is often referred to as schema translation in the 
literature. In our solution, we assume that such a translation tool is available 
as an operator, say SQL2XSD, which takes as input a relational schema and 
produces as output an XML schema and a mapping between the original 
and converted schema elements. Thus, the schema c and the mapping s 2 c 
between s 2 and c shown in Fig. 2.2 are obtained as (c, S 2 _c) = SQL2XSD(s2). 
Schema c is illustrated in Fig. 2.3. Note that c is not yet the desired result di\ 
for example, c contains an unneeded complex type O-DETAILS, and differs 
from g ?2 structurally. 

Now, our solution for the change propagation scenario can be expressed 
as the following script: 
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converted 
XML schema 




Fig. 2.3. Converted schema c and support element ORDERS in d 



operator PropagateChanges(si, di, si_di, s 2 , c, S 2 _c) 

1. Si_S 2 = Match(s 1; s 2 ); 

2. {dl 1 ,di_d'^) = Delete(di, Traverse(All(si) — Domain (si_s 2 ), Si_di)); 

3. {d, c_d) = Extract(c, Traverse(All(s 2 ) — Range(si_s 2 ), S 2 — c)); 

4. d_d\ = Invert ((L-C 7 ) * Invertfs? c) * Invert(si_S 2 ) * s\_d\ * di_d'i, 

5. {d 2 , d 2 d , d 2 -jd[) = Merge(c', d' 1 ,d_d' 1 )] 

6. S 2 _d 2 = S 2 —C * c_d * lnvert(d 2 _d) + 

Invert(si_S 2 ) * si_di * di_d[ * Invert (d 2 —^ / i); 

7. return (d 2: s^d 2 )', 

The script defines a generic operator PropagateChanges, which takes six 
parameters as input (including the converted schema c), and produces two 
return values (d 2 , So do) as output. Below, we explain the script line by line. 

1. In line 1, schemas Si and S 2 are “matched” to detect the changes. The 
result is a mapping si_S 2 shown schematically in Fig. 2.2. Speaking in- 
formally, the mapping connects the equivalent elements of Si and s 2 . The 
new elements of s 2 (e.g., “ShipDate”) and deleted elements of si (e.g., 
“Brand”) have no matching counterparts, so they remain unconnected. 

2. Line 2 illustrates how operators can be combined. First, the deleted ele- 
ments of si are identified using the expression All(si) — Domain(si_S 2 ), 
i.e., all elements of si without the matched (and thus not deleted) ele- 
ments. Then, these elements are used to “traverse” the mapping si_g?i. 
For example, the deleted relational attribute “Brand” traverses si_di and 
yields the XML schema element “Brand” of d\ . Finally, these d\ images 
of the deleted elements are removed from d\ using the operator Delete. 
The result is a new schema d\ (a “subschema” of di), and a mapping 
d-\ d \ , which describes how d\ relates to d[. 

3. Line 3 is quite similar to line 2. The new elements of S 2 , i.e., those missing 
from the range of si_S 2 , traverse s? c into the converted model c. For 
example, the image of relational attribute “ShipDate” is an XML schema 
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element “ShipDate” obtained by conversion. A “subschema” c' containing 
the images of the new elements is then extracted from c using the operator 
Extract, which also returns the mapping c_d . In addition to the elements 
obtained by traversal like “ShipDate”, d contains an extra element of 
c, the complex type “ORDERS” that encloses “ShipDate”. Such extra 
elements are called support elements (Bernstein 2003). Support elements 
may have to be extracted to make d a well-formed XML schema. 

4. At this point, d \ is a subschema of d\ without the deleted elements, and d 
contains the added elements and their support elements. Schemas d[ and 
d need to be merged to obtain the final result (line 5) . As we explain in 
Sect. 2.3.5, the merging of two schemas is driven by a mapping that tells 
how elements of the two schemas, specifically the support elements of d, 
correspond to each other. The mapping between d\ and d is shown in 
Fig. 2.2 as an arc connecting the two enclosed rectangles. This mapping 
can be obtained by “composing” the existing mappings between d , c, si, 
S2, d i , and d[ as Invert(c_c') * Invert(s2_c) * Invert(si_S2) * Si_d\ * di_d[ . 
To get the composition right, mappings c_d , S2_c, and Si„S2 need to 
be ‘inverted’, i.e., the domains and ranges of the mappings need to be 
swapped. Thus, we determine by composition that the support element 
“ORDERS” in d corresponds to the element “PurchaseOrder” in d 

5. The final result of change propagation, schema g? 2, is computed by the 
Merge operator. Among other things, the operator Merge creates a single 
complex type definition from complex type “ORDERS” from d and 
“PurchaseOrder” from d^. Additionally, the operator returns two map- 
pings, g?2 — 1 d and d-2_d \ , which describe how c? 2 relates to the inputs to 
Merge, d and d[. 

6. As a last step, we compute S2— ^2, a new version of the mapping si_di 
given as part of the input. We need S2_d2 to ensure that our change 
propagation script can be re-applied if the source schema evolves again. 
Since g? 2 is obtained by merging d\ and d, the mapping S2— <^2 is essen- 
tially a union of two mappings, the one between S2 and the d\ -portion 
of c? 2 , and the one between the S2 and c'-portion of c?2- These two map- 
pings can be obtained by composition as s ? c * c_d * Invert (ri? d) and 
Invert(si_S2) * si_di * d\_d\ * Invert (^2—^1), respectively. Their union is 
denoted using the plus sign (+). To illustrate, the first mapping esta- 
blishes the correspondences between the added elements “ShipDate”, 
“FreightCh”, “Rebate” in S2 and their e?2 counterparts. The second map- 
ping in the union tells us that “OID” in S2 corresponds to “OrderlD” in 
d%, etc. 

Notice that the above script is not limited to propagating changes from 
relational schemas to XML schemas. In fact, the reverse propagation problem 
can be solved using the same script by assigning the original and modified 
XML schemas to Si and S2, and the relational schema to d\. Of course, the 
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input parameters c and s 2 c need to be obtained using a different converter, 
e.g., as (c, S 2 — c) = XSD2SQL(s 2 ). 

In our implementation, every intermediate result of a script such as the 
one above can be examined and adjusted by a human engineer using a gra- 
phical tool. Specifically, the result of Match in line 1 can be post-processed 
to remove the incorrectly suggested matches and add the missing ones. Simi- 
larly, the merging step of line 5 is in general a semiautomatic process, which 
may require human feedback. Finally, by adjusting the intermediate results 
of operator compositions in lines 2 and 3 the engineer can decide which ad- 
ditions and deletions should not be propagated. 

In the above discussion, we introduced several operators informally. To 
make these operators effective and usable by developers, their semantics needs 
to be specified precisely. Our goal is to make the semantics as “generic” as 
possible, so the operators can serve a broad range of model-management 
tasks. In the next two sections we describe this semantics, first by defining 
the structures on which they operate, and then by describing the operators 
themselves. 



2.2 Conceptual Structures 

Model-management applications deal with a wide range of metadata artifacts, 
which include not only schemas, such as the relational and XML schemas in 
our motivating scenario, but also view definitions, interface specifications, 
etc. We represent the formal descriptions, or models, of these artifacts as 
directed labeled graphs. This graph representation is quite flexible and can 
accommodate virtually any type of model. 

We also introduce two additional structures, called morphisms and selec- 
tors. Morphisms are binary relationships that establish n : m correspondences 
between the elements of two models (i.e. , nodes of two graphs). For example, 
in our motivating scenario morphisms are used for keeping track of the XML 
counterparts of the relational schema elements. Two morphisms, one between 
si and di and another between s 2 and d 2 , are shown in Fig. 2.1 using light 
gray lines. The third conceptual structure, selector, is a set of elements used 
in models. A major benefit of using selectors is that various operations, in 
particular the set operations, which would typically produce non- well-formed 
models if used on models, can be applied to selectors safely. 

In the following subsections, we define models, morphisms, and selectors 
as abstract graph and set structures. We also describe them in an equiva- 
lent representation as relations. The latter will make it easier to define the 
semantics of the operators, which follow later. 

We briefly review the conventional metadata terminology that we use be- 
low. A meta-model can be thought of as a model that describes the structure 
of another model. Typically, it contains the type definitions for the objects 
used in models. For example, the Open Information Model (OIM) (Bernstein 
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et al. 1999) defines meta- models for several database schema and transfor- 
mation languages. A meta-meta model is a representation language in which 
models and meta-models are represented. For example, the Unified Modeling 
Language (UML) specification uses an object-oriented meta-meta model cal- 
led MOF (Meta-Object Facility (OMG 2002a)). The meta-meta model of our 
prototype, which we discuss below, is based on directed labeled graphs. All 
models and meta-models can be viewed as instances of the meta-meta model. 

2.2.1 Models 

We represent models as directed labeled graphs. The nodes of such graphs 
denote model elements , such as relations and attributes in relational schemas, 
type definitions in XML schemas, clauses of SQL statements, etc. We assume 
that each element is uniquely identified by an object identifier (OID). A 
directed labeled graph is a set of edges (s, p, 6) where s is the source node, p 
is the edge label, and o is the target node 1 . The order of the nodes in a graph 
can be captured by an ordinal property on edges. Thus, conceptually a graph 
can be viewed as a relation M with four attributes, M(S: OID, P: OID, O: 
OID U Literal, N: integer), where N is an optional attribute used for ordering 
and S, P, O form a unique key. The node identifiers and edge labels are drawn 
from the set of OIDs, which can be implemented as integers, pointers, URIs, 
etc. The literals include strings, integers, floats, and other data types. The 
type of attribute O is defined as a union type of OIDs and literals. 




Fig. 2.4. Sample model 
shown as graph and 4- 
tuples 



Consider the example in Fig. 2.4. It illustrates how a relational table PRO- 
DUCTS defined in SQL DDL (top left) is represented as a graph (bottom 
left) and as a corresponding set of 4-tuples (on the right). The ovals in the 
graph denote OIDs, and rectangles denote literals. Nodes al, a2, a3 represent 
the table PRODUCTS and its columns PID and PName, respectively. Node 
a4 represents the primary key constraint on PID. For readability, the identi- 
fiers such as Table or Column are spelled out as names rather than opaque 
IDs. 

1 The notation (s,p,o) stands for (subject, predicate, object). 
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The order of the columns identified by the nodes a2 and a3 is determined 
by the values 1 and 2 of attribute N (fourth attribute of the table with 4- 
tuples). In general, the node ordering for a given {src node} and {edge label} 
is determined by the SQL query: SELECT M.O FROM M WHERE M.S={src 
node} AND M.P={edge label} ORDER BY M.N. In the example, we have 
M.S=al AND M.P=column. 

A formal specification of the rules for encoding a model as a graph is called 
a meta-model. A model is well-formed if it conforms to its meta-model. For 
example, Fig. 2.4 illustrates a graph encoding of relational schemas that uses 
specific edge labels, such as SQLtype or name, and auxiliary nodes, such as 
Table, varchar, or PrimaryKey. If we know the relational meta-model, we can 
tell whether or not a given graph represents a well-formed relational schema. 
For example, if we know that each column must have an SQL type, then 
removing the edge (a2, SQLtype, int) from the graph in Fig. 2.4 yields a model 
that is not well-formed. For the purposes of this chapter, it is unimportant 
how a meta-model is represented and how one checks that a model conforms 
to its meta-model. The details of the graph representation of models remain 
opaque to the developer of model management applications. Of course, the 
representation is visible to developers of model management operators. So, 
a developer must be aware of the representation to implement a custom, 
non-generic operator, e.g., an operator to normalize relational schemas. 

2.2.2 Morphisms 

Many metadata-intensive applications, such as data integration and warehou- 
sing tools, use a graphical metaphor like the one shown in Fig. 2.1 for re- 
presenting schema mappings. These mappings are shown to the engineer as 
sets of lines connecting the elements of two schemas. We call such mappings 
(schema) morphisms. Thus, a morphism is a binary relation over two (possi- 
bly overlapping) sets of OIDs, i.e. , a set of pairs (l, r) drawn from OID x OID. 

Clearly, a morphism is a weaker representation of a transformation bet- 
ween two models than an SQL view or the mapping languages and expressions 
suggested in (Bergamaschi et al. 1999; Bernstein et al. 2000b; Davidson et al. 
1995a; Miller et al. 1994; Mitra et al. 2000). In particular, a morphism car- 
ries no semantics about the transformation of instances that conform to the 
models (e.g., no SQL WHERE-clause) . Still, we have found that many map- 
pings can be expressed in this way such as in our change propagation scenario 
of Sect. 2.1. The morphisms have several other advantages. Given our graph 
representation of models, a morphism can represent a mapping between diffe- 
rent kinds of models, e.g., between a relational and XML schema. A morphism 
can always be inverted and composed. (In contrast, an SQL view cannot be 
composed with an XSL transformation in an obvious way). And since mor- 
phisms can be expressed as binary relations, they can be implemented and 
manipulated easily. 
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CREATE TABLE PRODUCTS ' ( 
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Fig. 2.5. A morphism between a relational and an XML schema 



Consider the example in Fig. 2.5. The top part of the figure shows the 
relational schema of Fig. 2.4 and an XML schema. A morphism between the 
two schemas is depicted graphically as four arcs that connect the elements 
of the schemas. The bottom part of the figure shows the same morphism 
represented as a relation. The node identifiers al, a2, a 3 correspond to those 
of Fig. 2.4. The nodes b2. b3, b4, b5 denote respectively the complex type 
“Product” and the elements “ProductID”, “ProductName”, and “Product- 
Type” defined in the XML schema (its graph representation is illustrated in 
Fig. 2.6). Notice that a node can be connected to multiple nodes; e.g., a3 
(“PName”) is connected to b4 (“ProductName”) and b5 (“Product Type”). 
Moreover, various kinds of model elements, such as relations or attributes, 
can participate in a morphism. 




Fig. 2.6. Graph representation of XML 
schema in Fig. 2.5 



In an implementation, it may be convenient to annotate the pairs (1, 
r) with additional properties. For example, most implementations of the 
Match operator compute similarity values between the elements of two mo- 
dels. These values can be returned conveniently using a morphism in which 
each pair has an additional similarity property. Hence, although we define a 
morphism conceptually as a binary relation H(L: OID, R: OID), it may con- 
tain additional attributes, as required by the individual operators. Typically, 
the L elements originate from one model, and the R elements from another. 
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2.2.3 Selectors 

A selector is a set of node identifiers, which may originate from a single or 
multiple models. It can be represented as a relation with a single attribute, 
S(V: OID), where V is a unique key. Fig. 2.7 shows an example of a selector 
that contains all OIDs used in the model depicted in Fig. 2.4. 



v 



al 

a2 

a3 

a4 

Table 

Column 

PrimaryKey 

int 

varchar 



Fig. 2.7. Example of a selector 



2.3 Operators 

In our motivating scenario, we introduced several high-level operators whose 
inputs and outputs are models, morphisms and selectors, such as Match, 
Delete, Traverse, Extract, and Invert. Such operators raise the level of ab- 
straction of manipulating metadata structures by considering whole models 
and morphisms at a time, as opposed to using node-at-a-time primitives. 
For easy reference, the signatures and informal descriptions of the operators 
that are used in scripts most frequently are summarized in Table 2.1. In this 
section, we define the precise semantics of these operators on the structures 
defined in Sect. 2.2. We call this semantics structural. The implementation of 
the operators is covered in Chap. 3. 

We start our presentation of operator semantics in Sect. 2.3.1 with what 
we call primitive operators. These are generic operators whose semantics 
can be defined formally using the relational algebraic manipulation of the 
relational representations of Sect. 2.2. For notational convenience, we express 
this manipulation in SQL. After that, we introduce the other more powerful 
operators: such as Extract, Delete, Match, and Merge, whose semantics is 
more subtle and still a subject of ongoing research 2 . 

As we will see, some operators, such as Subgraph or Copy, are agnostic 
about the kind of models passed as input, whereas the semantics of others 
depends on the underlying meta-model. The GUI operators EditMap and 
EditSelector allow arbitrary transformations of morphisms and selectors by 
an engineer. Thus, their semantics cannot be constrained any further. 

2 In (Melnik et al. 2003a; Melnik et al. 2003b), we used slightly different operator 
signatures for Extract, Merge, and Diff. In this dissertation, we changed the 
directionality of the output mappings of these operators to facilitate a more 
natural notation when the mappings are functional. This difference is, however, 
purely syntactic. 
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Table 2.1. Summary of key operators in Rondo 



Signature 


Description 


Primitive operators 


s = Domain (map) 


Returns selector s that holds the 
elements in the domain of mor- 
phism map 


map2 = Invert (mapi) 


Swaps the “left” and “right” side of 
the input morphism mapi 


map2 = RestrictDomain(mapi , s) 


Restricts the domain of mor- 
phism mapi to the elements in 
selector s 


s = All(m) 


Returns a selector s containing all ele- 
ments of model m 


map = Id(s) 


Returns the identity morphism map 
for selector s 


m\_mz = Compose(mi_m2, m^m^) 


Composes morphisms m\_m2 and 


= m\_rri2 * m^ms 


m 2 _rrt 3 


Derived operators 


s = Range (map) 


Returns selector s that holds the ele- 
ments in the range of morphism map 


map2 = RestrictRange(mapi , s) 


Restricts the range of mor- 
phism map 1 to the elements in 
selector s 


S2 = Traverse(si, map ) 


Returns selector S2 holding the ele- 
ments in the range of morphism map 
that are reachable by traversing map 
from si 


= Delete(m, s) 


Returns a “submodel” md of m that 
does not contain the elements in sel- 
ector s 


More complex operators 


(m x ,m_m x ) = Extract (m, s ) 


Extracts a “submodel” m x of m that 
contains the elements in selector s 


mi_m2 = Match(mi, m2 [, seed]) 


Computes a morphism mi_m2 bet- 
ween mi and m2 using an optional 
initial morphism seed 


{m, m_mi , m_m2 ) = 


Merges models mi and m 2 using mor- 


Merge(mi, m2, mi_m2) 


phism mi_m2 



2.3.1 Primitive Operators 

Table 2.2 lists the definitions of seven primitive operators. The left column 
contains the operator definitions expressed in SQL. Variables m, s, and map 
hold a model, a selector, and a morphism, respectively. The right column illu- 
strates the application of the operators using simple examples. All primitive 
operators defined in the table are standard set-theoretic operators. Notice 
that their definitions are expressed declaratively, i.e., the implementation of 
these operators, or functional combinations thereof, can be optimized using 
standard query optimization techniques. 
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Table 2.2. Definitions of primitive operators 



Definition 



Domain (map) : = 

SELECT DISTINCT map . L AS V 
FROM map 



Example 



Domain( 



rr\cuvi map 

RestrictDomain(mop, s) := 

SELECT * FROM map 
WHERE map.L IN s 



al 


bl 


a2 


b2 



) = ® 
' \a2l 



RestrictDomain( 



al 


bl 


a2 


b2 



m - i a i ibi i 



Invert(map) : = 

SELECT map . R AS L, 
map.L AS R 
FROM map 

Compose(mapi , map 2 ) '■= 

SELECT DISTINCT map i.L, 

map2 . R 

FROM mapi, map 2 
WHERE mapi.R = map 2 -\~ 



lnvert (SI^) = 



bl 


al 


b2 


a2 



Compose( 



al 


bl 


a2 


i 



Ibi Id | ) = lal Id | 



TransitiveClosure(map) := 

WITH RECURSIVE TC(L, R) AS 
(map UNION 
SELECT DISTINCT TC.L, 

map.R 

FROM TC, map 
WHERE TC.R = map. L) 
SELECT * FROM TC 



TransitiveClosure ( — — ) 



a b 



b c 



M(s) : = 

SELECT s.V AS L, s.V AS R 
FROM s 



ld (S> = 



al 


al 


a2 


a2 



Subgraph(m, s) := 

SELECT * 

FROM m 

WHERE m.S IN s AND 
(m. 0 IN s OR isLiteral(m.O)) 



Subgraph(M, 



M = model of Fig. 2.4 




5— TPiDl 



The operator Domain extracts the “left” elements from a morphism and 
returns a selector that holds the result. The operator RestrictDomain restricts 
a morphism to a smaller element domain, which is specified by the selector 
passed as a second parameter of the operator. The Invert operator swaps 
the left and right elements of a morphism. The Compose (*) operator is 
defined as the natural join of two morphisms, yielding another morphism. 
The TransitiveClosure operator on morphisms is specified using a recursive 
SQL definition. The Id operator creates an identity morphism over a given 
selector. 

The operator Subgraph(m, s) extracts from model m a subgraph induced 
by the nodes referenced in s. The literals attached to the nodes in s are 
also extracted from m. In the example of Table 2.2, the literal “PID” is 
not contained in the input selector s, but the edge (a2, name, “PID”) is 
nevertheless returned as part of the result. The extracted subgraph may not 
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be a well- formed model. That is, it may not be fully connected and may not 
conform to its meta-model. 

The set operators Union (+), Difference (— ), and Intersection (fl) are 
another three important primitive operators. We define these on models, 
morphisms, and selectors by the corresponding set operations on their repre- 
sentation as relations. For example, 

Union(a;, y) := SELECT * FROM x UNION SELECT * FROM y 

Note that applying the set operations to well-formed models may produce 
a model that is not well- formed. 

The last two primitive operators are All and Copy. The operator All(m) 
returns a selector that contains only those nodes of to that denote the model 
elements of the model’s meta-model, such as tables or columns in the rela- 
tional meta-model. For example, for the model of Fig. 2.4 the operator All 
yields the selector {al, a2, a3, a4} and filters out all auxiliary nodes, such as 
Table or PrimaryKey, that are used in the graph encoding. 

Frequently, it is important to ensure that a given node identifier is used 
in exactly one model. Furthermore, unique node IDs make it possible to refer 
to model elements across model boundaries. For these reasons, we use the 
operator Copy to create a copy of a model m in which the selected node 
IDs are replaced by new, uniquely created IDs. In the following definition of 
Copy, the function uniqueOID() generates a unique OID on each call, and the 
function ifNULL(x, y, z) returns y whenever x is a NULL value, z otherwise. 
If s = All(m), the output morphism m_m! is a bisection between All(m) and 
AII(to'). 

Copy(m, s) := 

m_m' = SELECT s.V AS L, uniqueOID() AS R FROM s; 
m' = SELECT ifNULL(Tl.R, m. S, Tl.R), m.P, 
ifNULL(T2.R, m. 0, T2.R) 

FROM m, m_m! AS Tl, m_m! AS T2 
LEFT OUTER JOIN ON m.S=Tl.L, to.0=T2.L; 

return (m 1 

Fig. 2.8 illustrates the operator Copy. The operator takes as input the 
model to of Fig. 2.4 and selector {al, a2, a3, a4} = All(m). As a result of 
copying, a new model has been created (on the right), in which the nodes 
IDs al, a2, a3, a4 have been replaced by the generated unique IDs a5, a6, a7, 
a8, respectively. 

2.3.2 Derived Operators 

The derived operators are functional combinations of other operators. For 
example, consider the definitions shown below. 
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Fig. 2.8. Examples of copying the model of Fig. 2.4 using selector {al, a2, a3, a4} 



operator Range(map) 

return Domain(Invert(?nap)); 

operator RestrictRange(map, s) 

return Invert(RestrictDomain(Invert(map), s)); 

operator Traverse(s, map ) 

return Range(RestrictDomain(map, s)); 

operator Restrict(map, m±, m2) 

return RestrictRange(RestrictDomain(map, All(mi)), All(m2)); 

The Range of a morphism is obtained as the domain of an inverted mor- 
phism, by combining the primitive operators Domain and Invert of Table 2 . 2 . 
Similarly, RestrictRange is specified in terms of the operator RestrictDomain 
by first inverting the input morphism, then applying RestrictDomain, and 
finally inverting the resulting morphism once again. 

The third operator, Traverse, was used in our motivating scenario for 
locating the d\ images of the elements deleted from the relational schema .Si. 
To traverse the nodes in the selector over a morphism, the morphism is first 
domain-restricted by the selector, and the range of the restricted morphism 
is returned as output. 

The last operator, Restrict, confines the domain and range of a morphism 
to the elements of two models mi and m2. Notice that the definitions of the 
derived operators above are expressed declaratively, allowing the implemen- 
tations to be optimized. 

2.3.3 Extract and Delete 

Extracting and deleting portions of models are operations that are heavily 
deployed in metadata applications. To perform these operations, we propose 
the generic operators Extract and Delete. The operator Extract is applied as 
follows: (m / ,m_m / ) = Extract(m, s). The inputs are a well-formed model m 
and a selector s that identifies the set of nodes to be extracted. The output 
model m' satisfies the following properties: 

i. m! contains all selected nodes, 

ii. m! is a well-formed model, 
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iii. m' is an equally or less expressive model than m, i.e., m can represent 
all information of m' , and 

iv. m! is a ‘minimal’ model that satisfies (i)-(iii). 

Condition (iii) can be characterized formally in terms of dominance and 
information capacity as suggested in (Hull 1986; Miller et al. 1994). The 
morphism m_m' is an injective function from All(m) to Al^m'), i.e., each 
model element of m has at most one counterpart in m! . 

In general, a model may contain implicit information, such as transitive 
relationships between model elements. In such cases, the result of Extract may 
need to make such information explicit. For example, consider a class diagram 
with three classes A, B , C, and two explicit subclass definitions: A is a sub- 
class of B , and B is a subclass of C. Due to condition (iii), Extract(m, {A, C}) 
should return a class diagram in which A is defined as a subclass of C. This 
example illustrates that extraction is a rich operation, whose semantics and 
implementation may be non-trivial. 

Conceptually, the semantics of the operator Extract (m, s) can be realized 
using the following algorithm: 

1. Create a “closure” of m , i.e., a model m! in which all implicit information 
of m is represented explicitly. 

2. Assign s' = s, where s' is a temporary selector. 

3. For each x in s', extend s' with elements needed to satisfy conditions (ii) 
and (iii). 

4. Apply 3 until a fixpoint is reached, i.e., s' will not change. 

5. Extract subgraph t' induced by s' as t' = Subgraph(m/, s'). 

6. Obtain a “cover” of t ' , i.e., a minimal model t that is semantically equi- 
valent to t' . 

7. Return Copy(f, All(f)) as result of extraction. Notice that the operator 
Copy (Sect. 2.3.1) returns a model and a mapping. 

Deleting a selected portion of a model can be defined as extraction of the 
unselected portion. Thus, we define 

operator Delete(m, s) 

return Extract(m, All(m) — s); 

Note that the nodes of s that do not represent the model elements of m, 
i.e., are not members of All(m), have no impact on the result of deletion due 
to applying All (to) — s. 

2.3.4 Match 

The purpose of Match is to uncover how two models “correspond” to each 
other. It takes two models as input and returns a morphism between them. 
Match is inherently heuristic. So following the previous literature on Match 
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(Rahm and Bernstein 2001), we do not offer a formal definition of what 
constitutes a correct output morphism. In general, matching two schemas 
requires information that is not present in the schemas and cannot be fully 
automated. Hence, a human engineer needs to review and adjust the sugge- 
stions produced by an automatic procedure, either in a post-processing step 
or iteratively. 

2.3.5 Merge 

To combine two models into one, we utilize the operator Merge, applied as 
(m, m_wi, m_rri 2 ) = Merge(mi, m 2 , map). If the input models mi and m 2 
are well-formed, Merge should produce a well-formed model m that 

i. is at least as expressive as each of the input models, i.e., capable of 
representing the information contained in both models, and 

ii. is “minimal”, i.e., the elements shared between the input models are not 
replicated unnecessarily. 

The third parameter to Merge is a morphism map that describes model 
elements of and m 2 that are equivalent and should be “merged” into a 
single model element in m. The output morphisms m_mi and m_m ,2 identify 
the counterparts of the elements of mi and m 2 in the merged model m. 

The conceptual definition of Merge given above does not say anything 
about the naming and ordering of model elements. For example, it does not 
prescribe that the attribute names of mi take precedence over those of m 2 , 
or the other way around. These details are not considered to be part of the 
semantics of Merge because they inherently involve end-user decision making. 
They are discussed in Sect. 3.2.7. 

In the next chapter, we discuss the implementation of the conceptual 
structures and operators presented above. 
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“My way is to seize an image that moment it has formed in my 
mind, to trap it as a bird and to pin it at once to canvas. Afterward 
I start to tame it, to master it. I bring it under control and I develop 
it.” 



- Joan Miro (1893-1983) 



This chapter is devoted to the implementation and deployment aspects of the 
first prototype for model management developed as part of the thesis. The 
chapter is structured as follows: 

— In Sections 3.1 and 3.2, we describe the implementation of the concep- 
tual structures and operators, respectively. In particular, we present new 
algorithms developed for the operators Extract and Merge. 

— In Sect. 3.3, we present our prototype in more detail and demonstrate how 
it can be extended to embrace new kinds of models. 

— In Sections 3.4 and 3.5, we examine the solutions for two further impor- 
tant model-management tasks, view reuse and reintegration, that involve 
manipulations of relational schemas, XML schemas, and SQL views. 

We conclude the chapter and Part I in Sect. 3.6. 



3.1 Conceptual Structures 

In this section we discuss our implementation of the conceptual structures. 
We have found that the relations that were used in Sect. 2.2 as standard ma- 
thematical representation of graphs actually are a convenient implementation 
structure too. Our graph representation is based on the classical relational 
data model, in which node identifiers are constants that can be shared across 
models. We chose a relational approach instead of an object-oriented one 
(e.g., the one in (Bernstein et al. 2000b)) to simplify the implementation and 
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specification of the operators, which can often be done using SQL. Our rela- 
tional graph model is based on the W3C’s Resource Description Framework 
(RDF) (Lassila and Swick 1998; Powers 2003). 

For encoding relational schemas, XML schemas, and SQL views as graphs 
we use the following approach. Our meta-model for relational schemas is 
based on OIM (Bernstein and Bergstraesser 1999). For example, the model 
elements of a relational schema comprise tables, columns, and constraints; a 
table contains an ordered list of columns, each of which has a type; tables 
and columns carry names; the constraints are specialized into primary key, 
unique key, non-null, or referential constraints; a referential constraint refers 
to two columns, one of which is a foreign key and the other is a primary key; 
etc. Our graph representation of XML schemas builds on XML DOM (DOM 
1998). The graph representation of SQL views that we deploy is comparable 
to a parse tree produced by an SQL processor (see Fig. 3.8 in Sect. 3.4). All 
clauses, statements, alias definitions, functional terms, etc. are represented 
as separate nodes. A view graph does not replicate the names of attributes 
and relations used in schemas, but refers directly to the respective nodes in 
the schema graphs. 



3.2 Operators 

Now we turn to the implementation of the operators of Sect. 2.3. The out- 
put of the primitive operators is defined uniquely in Sect. 2.3.1, except for 
the operator All, which is implemented differently for each meta-model. For 
example, for relational schemas the implementation of All is specified as fol- 
lows: 

All(m, s) := SELECT to. S FROM to WHERE TO.P=type AND to. 0 IN 
{Table, Column, PrimaryKey, UniqueKey, 

NonNull, Referenda I Constraint} 

3.2.1 Extract and Delete 

To describe our implementation of the Extract and Delete operators we fo- 
cus on the relational schemas. Consider the schema to shown on the left of 
Fig. 3.1. The primary key constraints on PID and DID are depicted as hori- 
zontal bars underlining the respective attributes. The referential constraint is 
shown as a line connecting PRODUCTS. PID and O-DETAILS. PID. Assume 
that in the graph representation of to the three constraints are denoted by 
the nodes cl, c2, and c3, respectively. For brevity, we henceforth refer to the 
graph nodes representing the attributes of m simply by using their names. 

Fig. 3.1 illustrates six examples of extraction and deletion. The output 
morphisms to_toi, . . . , m_me are omitted in the figure for compactness; they 
simply connect the respective elements of the input and output schema that 
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m 




m4 = Extract(m, {O-DETAILS.PID}), 
m5 = Extract(m, {(g)}): 



PRODUCTS 
I PID: int 



m6 = Delete(m, {PRODUCTS.PID, <0>, < 
O-DETAILS. DID, ©}): 

ml = Extract(m, {PRODUCTS.PName}): 

PRODUCTS PRODUCTS 



PName: varchar 



m2 = Extract(m, {PRODUCTS.PID}), 
m3 = Extract(m, {(Q)}): 

PRODUCTS 



PID: int 
Quantity: int 
Price: real 



PID: int 



Fig. 3.1. Examples of extraction and deletion from a relational schema m 



carry the same name. The first example demonstrates extraction of the attri- 
bute PName, which produces schema m\. Condition (ii) of Sect. 2.3.3 requires 
that mi be a well-formed relational schema, i.e. , attribute PName belongs to 
a relation and has a type specification. Applied to relational schemas, con- 
dition (iii), which requires the output model to no more expressive as the 
input model, makes the extracted schema contain all constraints present in 
the original schema that affect the selected model elements. For example, 
extracting the attribute PRODUCTS.PID from m causes the primary key 
constraint cl to be extracted as well, yielding the schema m 2 . Dropping cl 
would violate (iii), since it would allow the attribute PID to contain duplica- 
tes and thus the original schema m could not represent all information of m 2 . 
Analogously, extracting O-DETAILS.PID from m (as schema 7714) needs to 
preserve the referential constraint c 2 , which in turn requires the presence of 
PRODUCTS.PID and its primary key constraint c3. Condition (iv) prevents 
any other attributes from appearing in m 4 . 

In our prototype, the implementation of operator Extract(m, s ) for rela- 
tional schemas is based on the conceptual algorithm of Sect. 2.3.3. Steps 1 
( “closure” ) and 6 ( “cover” ) are equality assignments. Step 3 of the algorithm 
is implemented as follows: 

— If s' contains constraint x, add to s' all attributes that participate in the 
constraint definition. 

— If s' contains attribute x, s' is extended to include 

a. the enclosing relation of x , 

b. the type definition of x , 

c. the referential constraint or non- null constraint for x, 
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d. the primary key or unique key definition for x , but only when all attri- 
butes participating in the key definition are contained in x. 

In Fig. 3.1, schemas m 3 and 7715 illustrate the extraction of nodes that 
denote constraints. To illustrate case (d), consider a relation P(Name, DOB, 
Addr) with a unique key constraint on (Name, DOB). According to the al- 
gorithm, Extract(m, P.Name) yields P(Name). The unique key constraint is 
not included since P.DOB is not selected. 

Notice that condition (iii) of Extract makes it impossible to delete a con- 
straint on a relational attribute without deleting the attribute definition, or 
to delete the primary key attribute participating in a referential constraint 
without deleting its foreign key attribute. For example, consider schema me 
in Fig. 3.1. Selecting PRODUCTS. PID and the constraints cl and c2 is not 
sufficient for deleting this attribute: the attribute O-DETAILS. PID, which is 
a foreign key on PRODUCTS. PID, is not selected; therefore, dropping PRO- 
DUCTS. PID would extend the set of possible values that O-DETAILS. PID 
may take beyond those contained in PRODUCTS. PID and hence violate 
condition (iii). In Sections 3.2.3 and 3.2.4, we present more flexible opera- 
tors ExtractMin, DeleteHard, and DeleteSoft, which allow such deletions by 
providing fewer consistency guarantees than Extract and Delete. 

Extraction from XML schemas is implemented analogously to the above 
algorithm. Type references in XML schemas are treated similarly to the re- 
ferential constraints in relational schemas. Currently, derived types are not 
supported. 

3.2.2 Dependencies 

As we observed above, the operators Extract and Delete disallow semanti- 
cally questionable transformations on schemas, such as dropping arbitrary 
constraints, and are defined for schemas only. In general, deletion on models, 
which may on may not be schemas, needs to be done in a careful way to 
ensure that the consistency of the resulting model with respect to its meta- 
model is not violated. For example, consider the relation ORDERS shown at 
the bottom of Fig. 3.2. If we were to delete just the definition of the table 
ORDERS, we risk getting an inconsistent model, in which fields like OID do 
not belong to any table. Or, if we delete the field ORDERS. OID, we might 
get a malformed referential constraint for O-DETAILS. OID, whose target key 
definition is now missing. To deal with such consistency issues in a more ge- 
neral way, we exploit the concept of existential dependencies between model 
elements. 

Figures 3.2 and 3.3 show examples of dependencies that hold between 
the elements of a relational schema, and between the elements of an XML 
schema. Each of the arcs specifies that the source element of the arc is exi- 
stentially dependent on the target element. For example, in the relational 
schema of Fig. 3.2, the attribute “UnitPrice” cannot exist without its type 
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i CREATE TABLE O-DETAILS { 
'''-£>/ $ rnt PR IMARY KEY, 

N- 0/D int RE FERENCES ORDERS, 
UnitPriced^mle, 

7 CREAT^ TABLE ORDERS ( 
Q/$ mt PRI MARY KEY, 

...) 



Fig. 3.2. Example of existential dependen- 
cies in a relational schema 



definition (arc from “UnitPrice” to “double”). Similarly, the primary key con- 
straint in table O-DETAILS is malformed if the constrained field “DID” is 
missing. The referential constraint between the fields O-DETAILS. OID and 
ORDERS. OID spans two tables, and requires both a foreign key and a pri- 
mary key. Analogously, in the XML schema of Fig. 3.3, the definition of the 
element “shipTo” depends on the existence of the complex type “Address” 
as well as on the enclosing sequence element, etc. 



2L 



<xsd:complexType name- L PurchctseOrder”> 



<sequence> 



r— <xsd: clement name=“shipTo” typ e-‘Address”/> 
<xsd:element ref '-'''comment” minOccurs "f)"/> 



</sequence> 

</xsd:complexType> 

<xsd:element name- ‘comment" type= 
<xsd:complexType nam e=“Address’’> 



‘xsd:string”A 



Fig. 3.3. Example of existen- 
tial dependencies in an XML 
schema 



As illustrated in Figures 3.2 and 3.3, dependencies are binary relations 
over the elements of a single model. Thus, we represent dependencies as intra- 
model morphisms, whose left elements are dependent on the right ones. To 
obtain the dependencies for a given model, we use the operator Dependen- 
cies, which invokes a non-generic implementation to compute the dependency 
morphism for the given model. For each supported model type, one such non- 
generic implementation is provided (one for relational schemas, another one 
for XML schemas, etc.). In our implementation, the operator Dependencies 
uses the arc types defined in the meta-model to determine what arcs are de- 
pendency arcs. For example, the arcs column and SQLtype of Fig. 2.4 are 
marked as dependency arcs in our representation of the meta-model for rela- 
tional schemas; the target of an arc of type SQLtype depends on the source, 
and the source of arc of type column depends on its target. 

3.2.3 ExtractMin 

A general intuition behind extraction is that we want to obtain a minimal 
model that contains the nodes in the selector and all those nodes and edges 
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that are necessary to make the resulting subgraph a “complete” , well-formed 
model. Obviously, such model has to contain at least those nodes that are 
existentially required for the nodes in the selector. This minimalist subgraph 
can be obtained using the operator ExtractMin defined below, which uses an 
auxiliary derived operator Reachable. 

operator ExtractMin(M, selector, dependencies) 

T = Subgraph(M, selector + Reachable(selector, dependencies)); 

return Copy(T, All(T)); 

operator Reachable(selector, map) 

return Range(RestrictDomain(TransitiveClosure(map), selector)); 

The operator ExtractMin takes three parameters as input, a source model 
M, a selector that identifies the elements to be selected, and the dependency 
morphism for M. The operator returns the subgraph of M induced by the 
union of the nodes in the selector and all nodes that are required to satisfy 
the existential dependencies of the selected nodes. These required nodes are 
obtained using the operator Reachable. 

To illustrate how Reachable works, imagine that it is called with parame- 
ters {a,d} as selector and {(a,b),(b,c)} as the dependency morphism of model 
M. We get: Reachable({a,d}, {(a,b), (b,c)}) = Range(RestrictDomain({(a,b), 
(b,c), (a,c)}, {a,d})) = Range({(a,b), (a,c)}) = {b,c}. Thus, selecting 
ja,d} from model M yields Subgraph(M, {a,d} + {b,c}) = Subgraph(M, 
{a,b,c,d}). The resulting subgraph contains by definition all edges between 
{a,b,c,d} and their incident literals. Notice that the operator Reachable can 
be executed by the optimizer efficiently, without materializing the transitive 
closure. This observation is important, since the dependency closures of even 
moderately-sized models may contain hundreds of thousands of entries. 

As another example, consider selecting a single node denoting the attri- 
bute “UnitPrice” from the model of the relational schema of Fig. 3.2 using 
ExtractMin. As shown in the figure, the type definition of “UnitPrice” and 
the relation “O-DETAILS” are required for the attribute definition, so that 
the operator Extract returns a subgraph of the model that represents the 
relational schema 

CREATE TABLE O-DETAILS (UnitPrice double) 

Similarly, if a single node denoting the primary key of table ORDERS is 
selected, we get 

CREATE TABLE ORDERS (OID int PRIMARY KEY) 

In this case, the node identifying the table ORDERS is pulled out due to 
the transitive dependency of the primary key on the table definition via the 
attribute definition. 
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3.2.4 DeleteHard and DeleteSoft 

As noted in Sect. 2.3.3, extracting a selected portion of a model can be 
viewed as deletion of the unselected portion. To support a broader range of 
model management scenarios, we define additional two variants of deletion, 
DeleteHard and DeleteSoft. Both operators remove a portion of a model 
referenced by a selector. The intuition behind DeleteHard is that we want 
to obtain a maximal consistent submodel without the selected nodes. It is 
defined as follows. 

operator DeleteHard(M, selector, dep) 

toDelete = selector + Reachable(selector, Invert(dep)); 
toKeep = All(M) — toDelete; 

return ExtractMin(M, toKeep, dep); 

Essentially, the operator DeleteHard takes All(M) elements of M, sub- 
tracts from this set the elements to be deleted, and applies ExtractMin to 
extract the unselected portion of the model. To take the existential dependen- 
cies into account, DeleteHard extends the selector passed as input to include 
all elements of M that would become “dangling”, i.e. , elements that are exi- 
stentially dependent on the elements to be deleted. Such would-be dangling 
elements are obtained by passing the selector and the inverted dependency 
morphism to the operator Reachable. That is, the dependencies are traversed 
in the reverse direction. 

Consider again the example in Fig. 3.2. Imagine that we DeleteHard the 
nodes representing the attribute O-DETAILS. UnitPrice and the table OR- 
DERS. The set of elements Reachable from these selected elements over 
the inverted dependency morphism are the foreign key constraint on O- 
DETAILS. UnitPrice and all attributes of ORDERS (to see that, the arcs 
in the figure need to be traversed in the reverse direction). That is, the con- 
straint and the table ORDERS with all its attributes will be removed, and 
we get the schema 

CREATE TABLE O-DETAILS (DID int PRIMARY KEY, OID int) 

In contrast to DeleteHard, the operator DeleteSoft removes each selected 
element only if it has no unselected dependent elements. That is, in the above 
example, the table ORDERS would not be deleted since it is referenced by the 
unselected foreign key on O-DETAILS. OID. The result of applying DeleteSoft 
for the same input parameters is shown below. Only O-DETALS. UnitPrice 
has been removed. 

CREATE TABLE O-DETAILS ( 

DID int PRIMARY KEY, 

OID int REFERENCES ORDERS) 

CREATE TABLE ORDERS (OID int PRIMARY KEY, . . .) 
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The operator DeleteSoft is defined below. Instead of extending the selector 
to cover the would-be dangling elements, it is restricted to make sure that 
no unselected elements are removed. The selector that keeps the elements 
that cannot be deleted (cannotBeDeleted) is first obtained by collecting all 
elements which the unselected elements depend on. Now, the input selector 
is adjusted to eliminate all these undeletable elements. Finally, the operator 
ExtractMin is applied, just as in the operator DeleteHard. 

operator DeleteSoft(M, selector, dep) 

cannotBeDeleted = Reachable(All(M) — selector, dep); 
toDelete = selector — cannotBeDeleted; 
toKeep = All(M) — toDelete; 

return ExtractMin(M, toKeep, dep); 

Table 3.1 summarizes the differences between the operators discussed 
above and illustrates them using a single characteristic example for rela- 
tional schemas. The “hard” version of deletion in schemas is similar to the 
cascading delete of existentially dependent data tuples, which is supported 
by many relational database systems. 



Table 3.1. Comparison of variants of extraction and deletion 



Operator 


Example 


Extract 


Cannot extract a field without the constraints de- 
fined for the field. 


ExtractMin 


Can extract a field without the constraints defined 
for the field. 


Delete 


Cannot delete a constraint defined on a field with- 
out deleting the field. 


DeleteSoft 


Can delete a constraint defined on a field without 
deleting the field. Cannot delete fields referenced 
by unselected fields. 


DeleteHard 


Can delete fields even if they are referenced by 
unselected fields. In this case, dangling references 
would be deleted, too. 



3.2.5 Diff 

The Diff operator computes the difference between a model m and another 
model m! that is connected to m using a mapping m_m'. Intuitively, the 
difference between two models is a sub-model of m that does not participate 
in the mapping m_m ! . In other words, to obtain the difference we eliminate 
from m all elements that do have matching counterparts in the other model. 
Thus, we define the operator Diff as shown below: 

operator Diff(m, m_m') 

return Delete(m, Domain(m_TO')); 
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Similarly to the operators DeleteSoft and DeleteHard, we provide addi- 
tional two versions of the Diff operator: DiffSoft and DifFHard. 

operator DiffSoft(7n, m_m') 

return DeleteSoft(m, Domain(m_m/)); 

operator DiffHard(7n, 

return DeleteHard(m, Domain (m_m')); 

Notice that given the the differencing operators, we could define deletion 
as derived operations. For example, the operator Delete could be defined 
based on Diff as 

operator Delete(m, s) 
return Diff(m, Id(s)); 

3.2.6 Match 

In our prototype, the Match operator takes as input two models of the same 
kind, e.g., two relational schemas, and returns as output a morphism. We 
implemented Match using the Similarity Flooding (SF) algorithm, a graph- 
matching algorithm presented in Chap. 7. The SF algorithm exploits the 
structure of the graphs to be matched and performs especially well for de- 
tecting the differences between two versions of a schema, which is the case in 
our motivating scenario and many other typical metadata applications. 

The SF algorithm takes as input two graphs m\ and m 2 , and a set of initial 
similarity values between the nodes of the graphs, expressed as a weighted 
binary relation seed. Each pair (1, r) of seed carries a similarity value between 
zero and one. In a fixpoint computation, the algorithm iteratively propagates 
the initial similarity of nodes to the surrounding nodes, using the intuition 
that neighbors of similar nodes are similar. The output of the algorithm is 
another weighted binary relation. 

In Sect. 2.2.2 we defined a morphism as a binary relation. To include 
weights in a morphism, we add to it a third attribute Sim that holds a 
similarity value for each pair of nodes. The primitive operators in Sect. 2.3.1 
ignore this extra information. We implement the operator Match as 

operator Match(mi, m 2 , seed) 

multimap = SFJoin(mi, m 2 , seed)', 
multimap = Restrict (multimap, mi, m 2 ); 
map = FilterBest(m'uZfimap); 

return (map, multimap); 

The operator SFJoin encapsulates the SF algorithm. As explained in 
Chap. 7, the multimap returned by the algorithm may contain a large frac- 
tion of the cross product between nodes in mi and m 2 , and needs to be 
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filtered. The operator FilterBest implements the filter suggested in Chap. 8, 
which exploits the stable-marriage property. In addition to filtering, we re- 
strict the result of the SFJoin operator to the nodes that represent the mo- 
del elements of toi and m 2 using the operator Restrict (Sect. 2.3.2). The 
input morphism seed is typically obtained using another auxiliary operator 
NGramMatch(mi, m 2 ), which computes the similarities of literals in mi and 
m 2 based on the number of n-grams that they have in common. Alternati- 
vely, seed can be obtained by composition of morphisms. If seed is omitted, 
NGramMatch is invoked in SFJoin by default. 

The above Match implementation returns both the filtered morphism 
map, and the unfiltered multimap. The morphism map can be adjusted by 
the engineer using a graphical tool by invoking the operator EditMap on 
the outputs of Match, e.g., as map = EditMap (map, multimap). The gra- 
phical tool allows the engineer to inspect all candidate matches suggested in 
multimap. 

The script used above for implementing the Match operator can be easily 
adapted to call other external schema matchers, which may deploy thesauri, 
analyze schema annotations, mine samples of instance data, reuse previous 
match results, etc., to reduce the manual post-processing effort. 

3.2.7 Merge 

We discuss our implementation of the Merge operator using the example in 
Fig. 3.4. On the top, two sample models mi and m 2 get merged into m (the 
output morphisms are omitted). The morphism map is depicted using di- 
rected arcs. The direction of each arc establishes a preference between two 
model elements; when collapsing the two elements, the target element is kept 
in the output m, whereas the source element is discarded. For example, the 
attribute PO.OrderDate is kept and ORDER. ODate is discarded, as illustra- 
ted in the figure. Such preferences are not part of the semantics of the Merge 
operator (Sect. 2.3.5), but are essential for practical deployment. 

The input morphism map contains an extra attribute Dir to hold the di- 
rection of the arcs (— > or •<— ). Before Merge is executed, a human engineer 
has a chance to specify the arc direction in a graphical tool by invoking the 
operator EditMap. The output morphisms provide the engineer an auditable 
trail of how the elements of the input models have been transformed into 
the elements of the output model. For example, although ORDER. ODate 
is discarded in m, the morphism m_m\ would tell the engineer that OR- 
DER. ODate from m\ has become ORDER. OrderDate in to. 

The bottom of Fig. 3.4 depicts mi and m 2 as graphs. For brevity, the arc 
labels, type edges, and literals are omitted (compare to Fig. 2.4). Node x cor- 
responds to relation ORDER, xl denotes ORDER. ODate, etc. The morphism 
map is (x, y, <-), (xl, y2, ->), (x2, zl, ->). 

To implement the Merge operator, we developed an algorithm called Gra- 
phMerge, which we describe below. Similar to (Buneman et al. 1992; Pottin- 
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ger and Bernstein 2003), the algorithm consists of three conceptual steps: 
node renaming, graph union, and conflict resolution. 

1. In the first step, the graph nodes at the blunt ends of map are renamed 
to their targets at sharp ends, in both graphs mi and m 2 . The result of 
renaming is shown on the bottom left of Fig. 3.4. Nodes y, xl, and x2 of 
both graphs have been renamed respectively to x, y2, and zl. 

2. In the second step, we do a graph union, i.e. , a set union of two sets of 
edges, and obtain the graph depicted on the bottom right of the figure. 
This graph is not a well-formed model, because the node zl, which used 
to represent the attribute CUST. Customer in m 2 , has now become an 
attribute of two different relations, x (ORDER) and z (CUST). 

3. Such conflicts are resolved in the third and final step of the GraphMerge 
algorithm. The above conflict is eliminated by deleting either the edge 
between x and zl, or the edge between z and zl, effectively making Cu- 
stomer either an attribute of relation CUST or an attribute of relation 
ORDER in the merged schema. The choice between the two options is 
made by a human engineer. 

Step 3 is the costliest step of the algorithm, since it requires human feed- 
back. To partially automate conflict resolution, we developed the following 
heuristic. Observe that in Fig. 3.4 it seems more “natural” to keep the attri- 
bute Customer in relation CUST than to move it to ORDER. To generalize 
this observation, we track the origin of each edge in the merged graph, and 
assign to each edge a tag, such as H — or o+, which indicates whether each of 
the nodes incident at the edge was a source node of map (— ), a target node 
(+) of map , or none of the two (o) (these are the only three possible cases 
assuming that source and target nodes of map are disjoint). For example, the 
edge (x, zl) obtained by renaming from (x, x2) is tagged with -| — , since x is 
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a target node and x2 is a source node of map. Analogously, the edge (z, zl) 
is tagged with o+, since z does not appear in map at all. 

If we knew that o+ edges are always preferred over H — edges, then, 
in a conflict (x, zl) could be eliminated without asking the engineer. We 
examined a variety of merge problems in the context of relational schemas, 
XML schemas, and SQL views, and established empirically a total order 
among all tag variations, which helps resolve many conflicts automatically 
in a way that matches human intuition. This order is shown in the middle 
right of Fig. 3.4. Intuitively, edges between unchanged nodes (oo) are least 
likely to be rejected in a conflict, and thus have the highest priority. Similarly, 
edges incident at + seem more likely to be preferred than those incident at 
— . Thus, Steps 2 and 3 are realized as follows. First, all edges in the merged 
graph are sorted by decreasing priority. Then, iteratively, each edge is taken 
off the top of the sorted list and is appended to an (initially empty) graph G. 
If appending the edge violates model consistency, it is rejected. Once all edges 
have been appended, the engineer examines the result and the choices made 
heuristically, and makes any necessary adjustments. The execution trace of 
the algorithm is stored in a log file, which lists the rejected alternatives. 

In the above description of the algorithm, we factored out an important 
aspect, the ordering of nodes within parent. To illustrate how we reestablish 
a correct order in the merged schema, consider Fig. 3.4. Node y denoting 
the relation PO is renamed to x. Thus, when merging this node with the 
original x in mi, we move attributes yl (Amount) and y2 (OrderDate) to 
the last position in the merged schema m. However, OrderDate “overrides” 
ODate, the first attribute in relation ORDER, and should remain at the first 
position. Hence, in schema m, the resulting order of attributes is OrderDate, 
CAddr, Amount. 

The GraphMerge algorithm is summarized below: 

Algorithm GraphMerge (mi, m 2 , map) 

M := mi U m 2 ; L := empty list; G := empty graph 
for each edge e in M do 

rename nodes of e using map ; assign tag to e; append e to L; 

end for 

sort edges in L by decreasing tag priority; 
maxN := SELECT max(Af.N) FROM M; 
while L not empty do 

take edge e = (s,p, o, n) off top of L; 

if tag(e) one of {“-o”, “ ”} then 

n := n + maxN\ 

if o is literal then continue loop end if 
end if 

if exists e' = ( s,p , o, n') in G then 
replace e' in G by (s,p, o, min{n, n'})\ 
else if not conflictsWith((s,p, o, n), G ) then 
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append ( s,p, o,n } to G; end if 

end if 
end while 
return G 

The number maxN is obtained as the highest existing value of the ordinal 
property N in mi and m 2 (compare Sect. 2.2.1). It is used to move the nodes 
hanging off renamed nodes to the last positions. To test for renamed nodes, 
we check whether the corresponding edge tag starts with — , i.e. , is one of — o, 

— F, or . The literals belonging to such renamed nodes are removed, to 

ensure that, e.g., the relation corresponding to node x in the merged graph of 
Fig. 3.4 will be named “ORDER” and not “PO” . The function conflictsWith() 
checks whether appending a new edge to G causes a conflict. 

The GraphMerge algorithm can be used for various kinds of models by 
implementing the function conflictsWithQ appropriately. In our prototype, 
we deploy the algorithm for merging relational schemas, XML schemas, and 
SQL views. For example, conflict detection for relational schemas checks that 
relations cannot contain relations instead of attributes, or that attributes 
cannot be shared among relations, etc. 

The Merge operator is implemented as follows: 

operator Merge(mi, m 2 , map ) 

G = GraphMerge(?ni, m 2 , map); 
s = SELECT L FROM map WHERE Dir="->” UNION 
SELECT R FROM map WHERE Dir= 

mi_G = RestrictDomain(map, All(mi) 0 s) + Id(All(mi) — s); 
m 2 _G = RestrictDomain(map, All(m 2 ) 0 s) + Id(All(m 2 ) — s); 

(m, G_m) = Copy(G, All(G)); 

return (m, Invert (mi_G * G_m), Invert (?7i2_G * G_m)); 

Recall that Merge must also return morphisms from each of its input 
models to its output model. Thus, after applying GraphMerge to obtain the 
merged model G, we compute the morphisms m\_G and m. 2 _G. The selector 
s contains all source nodes of map. For the example of Fig. 3.4, we obtain 
mi_G as union of domain-restricted map, {(xl, y2), (x2, zl)}, which maps 
each renamed m\ node to its new name, and the identity morphism on not 
renamed nodes, {(x, x), (x3, x3)}. Finally, G is copied to make the node IDs 
of the output model m unique, and the morphisms m\_G and ?7i2_G are 
composed with G_m, so they range over m instead of G. 

The GraphMerge algorithm does not “invent” new model elements or esta- 
blish new relationships between the existing elements. Therefore, the operator 
Merge as implemented above cannot reorganize schemas to resolve structural 
conflicts. For example, consider two XML schemas, S± with element FullName 
and S 2 with elements FirstName and LastName. Merging Si and S 2 should 
ideally create a new complex type Name with subordinate elements First- 
Name and LastName. Such structural conflicts can be addressed by using 
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n - way merges, in which intermediate schemas Sj are used for describing the 
desired structural transformations. 

In Sect. 2.3.5 we postulated two “semantic” conditions that Merge should 
satisfy. Our implementation does not automatically ensure that condition 
(i) holds. For example, the engineer might decide to “override” a non-null 
constraint on an attribute in one schema Si by a primary key constraint of 
the other schema S 2 , in which case the output model would be less expressive 
(i.e., more constrained) than Si. This flexibility is sometimes desirable in 
practice. 



In this section, we describe the architecture and main features of the proto- 
type in more detail 1 . Its architecture is shown in Fig. 3.5. A central compo- 
nent of the architecture is an interpreter that executes scripts. Its main task 
is to orchestrate the data flow between the operators. The interpreter can be 
run either from the command line, or invoked programmatically by external 
applications and tools. The operators can be defined either by providing a 
native implementation, or by means of scripts. For example, a native operator 
like ReadSQLDDL reads a text document containing the definition of a rela- 
tional database and creates its graph representation, whereas WriteSQLDDL 
exports the graph back as text. 



Schemas and instances in the native format such as XML or SQL DDL 
files are stored in the local file system. The models can additionally be loaded 
and stored in a (remote) metadata repository. The repository is an SQL- 
compatible database. The interpreter communicates with the repository via 

1 A demo of the prototype is available for download at 
http: //www-db . Stanford . edu/~melnik/mm/rondo/ 
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JDBC. Two native operators, ReadDb and WriteDb, load and store arbitrary 
graphs in the database. 

Native operators are defined in scripts using statements like 

alias ReadSQLDDL [Java class name]; 

Every native operator corresponds to a Java class compliant to a certain 
interface. Other operators that have been implemented natively include 

— all primitive operators of Sect. 3.2, 

— operators that launch GUIs for editing morphisms and selectors, such as 
EditMap or EditSelector, 

— schema translation and conversion operators such as SQL2XSD, and 

— the operators that implement the individual algorithms such as Similarity 
Flooding (SFJoin), GraphMerge, or the string matching operators descri- 
bed below. 

All other operators, such as Range, Match, or Merge, are implemented by 
scripts presented in the previous sections. The specification of the commonly 
used native or derived operators can be grouped in a single script and utilized 
in other scripts using include statements. 

StringMatch provides a hint how literal nodes in one graph match those 
in another. This string matcher splits a text string into a set of words and 
compares the word in two sets pairwise. In word comparison, we examine only 
common prefix and suffix. Optionally, term frequencies are used to reduce the 
impact of common terms in large schemas. In Chap. 7, we give an example of 
string matching in Table 7.1. NGramMatch is another, more efficient, string 
matching operator. It builds an in-memory inverted n-gram index over the 
labels used in the models. Then, both indexes are merged producing a list 
of pairs of labels. The complexity of NGramMatch is 0(n log n) instead of 
0(n 2 ) of StringMatch; it is determined by the sorting phase of the index 
construction. 

The scripting language that we use is quite simple. Every operator takes 
a list of models as input and produces a list of models as output. Load/store 
and import /export operators are an exception, since they accept additional 
parameters that are not models. Recall that mappings are models and there- 
fore can be used whenever model is expected as a parameter. For compactness 
of scripts, operators can be nested. 

The interpreter provides a debugging facility that allows examining the 
execution traces of complex scripts, and supports flexible handling of the 
input and output parameters of operators. For example, if an operator returns 
more than one argument (as does our implementation of the operator Match), 
some of which are not used subsequently (as in script PropagateChanges in 
Sect. 2.1), they can be tacitly ignored. 

For minimizing the amount of GUI programming needed for visualizing 
various kinds of models, we used the following technique. We require an ope- 
rator like WriteSQLDDL to output not only the textual representation of the 
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model, but also a data structure that describes how the terms in the text re- 
late to the model elements, or graph nodes. In this way the schema elements 
shown in Fig. 3.7 enclosed in boxes are associated with the graph nodes re- 
presenting those elements, and the GUI operators EditMap and EditSelector 
can be used in exactly the same way for relational schemas (Fig. 3.7) or SQL 
views (Fig. 3.8). 

At the current stage, our prototype supports the basic features of SQL 
DDL, XML Schema, RDF Schema, and SQL views, and, in preliminary form, 
UML. To introduce a new modeling language in the prototype, two steps 
are required. First, the import/export operators need to be provided, which 
ensure lossless round-tripping from the native format to graphs and back. 
Second, several callbacks need to be implemented for supporting the operators 
All, Extract, and GraphMerge. 

The code breakdown of the prototype is shown in Fig. 3.6. A large share 
of the implementation effort was due to the graph APIs responsible for in- 
memory representation and manipulation of graphs and morphisms, and the 
database support. The key generic model-management functionality compri- 
ses less than 7K lines of code. It includes the interpreter (2050), primitive 
operators (660), SFJoin (1760) and GraphMerge (700) implementations, as 
well as the generic GUI operators (1400). The non-generic part is essentially 
divided among the code needed to support SQL DDL, XML schemas, and 
SQL views. The smallest portion of code is due to converters: XSD2SQL 
(260), SQL2XSD (250), View2Morphism (90), and Morphism2View (200). 
The compactness of the converters is mostly due to the fact that they ope- 
rate on the internal graph representation using expressive queries. The total 
amount of code in the prototype is below 24K lines. The total scripting code 
developed so far is measured in hundreds of lines. 

The implemented scenarios run in a few seconds on a 600 MHz laptop 
with 256 MB of memory for moderately-sized schemas, which contain up to 
a few hundred model elements 2 . However, we found that our graphical user 
interface is inadequate for visualizing medium-size and large schemas. For 
example, schemas that contain around 40 table definitions stretch over dozens 
of computer screens, are hard to navigate, and the mappings between them 
clutter the screen. For schemas containing over a thousand table definitions, 
the running time performance of our GUI and the matching algorithm is 
inadequate. Intelligent GUI design and efficiency require future work. 

Further scenarios that we implemented include a reintegration scenario 
from the context of version management, iterative merge, a warehousing 
scenario, in which we extract a subset of the schema that is sufficient to an- 
swer a given set queries, and a view reuse scenario. The view reuse scenario 
is in Sect. 3.4. Among other aspects, it illustrates how views can be merged, 
presents the GUIs used in our prototype, and demonstrates the use of the 

2 The test schemas that we used are available at if r . sap . com, www . xcbl . org, and 
www.microsoft . com/biztalk. 
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Fig. 3.6. Code size breakdown in 
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operators Morphism2View and View2Morphism. The reintegration scenario 
is covered in Sect. 3.5. 



3.4 View-Reuse Scenario 

In this section, we examine another scenario, which illustrates the use of the 
operators presented in this chapter for addressing a typical data warehousing 
task. Consider adding a new source S 2 to a data warehouse D. Assume that 
£2 is similar to an existing source S\. The morphism Si_S% between the 
two source schemas is shown in Fig. 3.7. Let an existing SQL view vS\_D 
describe how the instances of S\ populate D. The view vS\_D is depicted in 
the middle of Fig. 3.8 (the relevant portion of the warehouse schema can be 
seen in the CREATE VIEW clause). Our goal is to reuse the view vS\_D for 
importing S 2 data into D , i.e. , creating the view vS^D. Conventionally, this 
problem is solved manually involving a tiresome and error-prone renaming 
of the attribute and relation names used in vS\_D based on the similarities 
between S± and 82 - In our prototype, we obtain vS?^_D using the following 
script: 



^Mapping Editor: Match sources 




Fig. 3.7. Morphism between sources 
Si and S 2 
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1. Si S 2 = Match(S'i, S' 2 ); 

2. S\_D = View2Morphism(w5'i_I?); 

3. S 2 —D = Invert(5 , i_S , 2) * S±_D ; 

4. vS 2 —D' = Morphism2View(S , 2_£ ) ); 

5. map = Match {vS^D' , vS\_D , Invert (5 'i_52)); 

6. vS^D = Merge (vS^D' , vS\_D , map + S-i S' 2 ); 

First, we match Si and S '2 to determine the correspondences between the 
schemas. As can be seen in Fig. 3.7, some of the elements of Si and S 2 remain 
unmatched, whereas others, such as Department. DeptName are matched to 
two elements, Companies. name and Companies. legalEntity. In Step 2, we ex- 
tract the morphism S\_D from the view definition vS\_D using a non-generic 
operator View2Morphism. For example, the morphism S\_D , which is omit- 
ted in the figures for brevity, associates the attribute Personnel. Pname with 
two attributes, Employee. EmpFName and Employee. EmpLName, etc. Next, 
we compute the morphism S^D by composition. In Step 4, a ‘template’ view 
definition vS^D' is generated from S-^_D using another non-generic operator 
Morphism2View. It is shown on the left of Fig. 3.8. Morphism S^D contains 
no information as to how the values of the attribute Personnel. Affiliation 
are obtained from Companies. name and Companies. legalEntity. Therefore, a 
functional term fctl is generated in vS^D' as a placeholder. 



^Mapping Editor Match views 




^Resulting view 



£ View definitions 

REATE VIEW 
Personnel( 

Pno, 

Pname, 

Affiliation, 

Rank, 

Income, 

DOB) 

AS SELECT 
Aid, 

((A.lname + 

".'V 
A.fname), 
fctl (B.name, 

B. legalEntity), 

CASE ((currentDateO- 
A started) / 

(5* 

365)) 

WHEN 0 THEN "newbie" 
WHEN 1 THEN “major'' 
ELSE "senior" 

END, 

NULL, 

A. b date 
FROM 

Companies AS B, 
Consultants AS A 
WHERE 
(A-Cid = 

B-cii) vS2 D 



Fig. 3.8. Merging two 
SQL views 
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In Step 5 , the template vS o , D' and the existing view vS\_D are matched, 
using as a seed the morphism between Si and S2. The resulting morphism, 
after some minor manual corrections, is depicted in Fig. 3 . 8 . Finally, in Step 
6 both view definitions are merged to obtain VS2—D, shown on the right. No- 
tice that the function symbol fctO has been correctly replaced by the nested 
concatenation, whereas fctl was left as is. The unmatched WHERE clause 
was borrowed from vS\_D\ the attribute references have however been cor- 
rectly replaced by Companies. cid and Consultants. cid. To achieve that, the 
morphism map passed to Merge is extended to include S\_S2- The heuristic 
deployed in the GraphMerge algorithm produces vS^D fully automatically, 
due to relative simplicity of the input views. 



3.5 Reintegration Scenario 

In this section, we illustrate another scenario called reintegration, or 3 -way 
merge. The reintegration problem arises when a model is modified indepen- 
dently by several engineers or tools. We focus on the case when there are 
two such independent modifications. Assume that model m was changed in- 
dependently into mi by Ann and into m2 by Bob. Our goal is to obtain the 
reconciled model m3 that incorporates the changes done by Ann and Bob, 
and the mappings m3_mi, m3_m2 and m^_m that describe how the models 
mi, m2, and m relate to the reconciled version m3. 

Consider the example in Fig. 3 . 9 . The original (relational) schema m 
is depicted on the top of the figure. In table ORDERS in schema m, em- 
ployees are represented by an opaque identifier. To store employees’ na- 
mes, Ann creates the table EMPLOYEES and makes ORDERS. EID a for- 
eign key into the new table. Also, she deletes ORDERS. PONum and O- 
DETAILS.UnitPrice and adds PRODUCTS. PDesc. Meanwhile, Bob creates 
the table BRANDS and replaces the attribute PRODUCTS. Brand by a for- 
eign key pointing to the new table. In addition, he adds a new attribute 
PRODUCTS. ISIC that holds the classification description of products. He 
deletes DETAILS. UnitPrice, just as Ann, and in addition he also deletes 
DETAIL S . Discount . 

One way of obtaining m3 is to simply merge mi and m2. That is, in the 
script shown below, we first match m\ and m2 (line 1 ) and apply the Merge 
operator (line 2 ). To compute the mapping m^_m, we need to know how m 
corresponds to each of m\ and m2. So, we match them in lines 3 - 4 . Now, 
each of the compositions m^_rrii * Invert(m_mi) and m^_m2 * Invert(m_m2) 
describes a part of the mapping from m3 to m. To obtain m3_m, we combine 
both compositions in line 5 . 

operator ReintegrateFirstCut(m, mi, m2) 

1. mi_m.2 = Match(mi, m2); 

2. (m 3 , m3_mi, m3_m 2 ) = Merge(mi, m 2 , mi_m 2 ); 
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Fig. 3.9. Reintegration scenario (3-way merge) 
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3. m_mi = Match(m,mi); // or given 

4. m_m2 = Match(m, TO2); //or given 

5. rri3_m = m^_rrii * Invert (rn_TOi) + m3_TO2 * Invert (m_m2); 

6 . return (m 3 , m 3 _m, TO 3 _TOi, 

The above approach has two major weaknesses. First, we have to apply 
the Match operator three times, each potentially requiring expensive human 
intervention. In practice, m_m\ and m_m,2 could be tracked automatically 
by the schema editing tool used by Ann and Bob. Still, matching mi and 
m2 from scratch can be costly. Second, the above script discards all deletions 
done exclusively by either Ann or Bob. That is, ORDERS. PONum and O- 
DETAILS. Discount would appear in m3 albeit both have been deleted. O- 
DETAILS. UnitPrice would, however, be correctly removed. 

To address the first problem, we could modify the above script by 
moving lines 3-4 to the top and obtaining mi_m.2 as the composition 
mi_rri2 = Invert(m_mi) * m_m.2. By doing so, however, we duplicate the 
equivalent additions done by both Ann and Bob, since the added equiva- 
lent elements have no counterparts in m and hence their correspondences 
get lost upon composition. That is, after executing such modified script, 
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ORDERS. Rebate would appear in m3 twice. And yet, we could use m\_m2 
computed by composition to drive the match between mi and m2, as in 
mi_m2 = Match(mi, m2, Invert(m_mi) * m_m2 + NGramMatch(mi, m2)). 
Moreover, when mi and m2 are large, it may be more effective to extract 
only the new portions of mi and m2 and match those. 

To address the second problem, which is due to losing deletions done 
exclusively by Ann or Bob, we could apply to mi all deletions done in m2, 
and likewise apply to m2 all deletions of mi. We incorporate both ideas in 
the script below: 

operator Reintegrate(m, m i, m 2 ) 

1. m_m\ = Match(m, mi); //or given 

2. m_m 2 = Matcli(m, m2); // or given 

3. (m/^m^mj) = Delete(mi, Traverse(All(m) — Domain(m_m2), m_mi)); 

4. (m 2 ,m2_m 2 ) = Delete(m,2, Traverse) All(m) — Domain(m_mi), m_m2)); 

5. (mix, m'^mij) = Extract (mj, Traverse( AU(mi) — Range(m_mi), 

mi_mi)); 

6- (m2x,m' 2 _tn2 x ) = Extract (m' 2 , Traverse( AU(m2) — Range(?n_m2), 

m2-m 2 )); 

7. mix_m2x—Core = Invert(m , 1 _mi x ) * Invert (mi m^) * Invert(m_mi) * 

m_m 2 * m2_m 2 * m' 2 _m2 X ', 

8. mix^_m2x — Match(mi x , m2 X , mi x _m2 X _core + NGramMatch(mi x , m2 X )); 

9. m'i_m ' 2 = m'i_mi x * mix_m 2 x * Invert (m' 2 _m2 1 ) + 

Invert (mi_m.i) * Invert(m_mi) * m_m2 * m2_m 2 ; 

10. (m3,m3_m/m3_m 2 ) = Merge(m' 1 , m' 2 , m(_m 2 ); 

11. m3_mi = m3_m , 1 * Invert (m^mj); 

12. m3_m2 = m3_m 2 * Invert (m2_m 2 ); 

13. m3_m = m,3_mi * Invert(m^mi) + m^jji2 * Invert (m_m2); 

14. return (m3, m^_m, ms_mi, m3_m 2 ); 

To illustrate the script, consider the schematic representation in Fig. 3.10. 
In line 3, we obtain the model m^ that contains all of mi, i.e. , the mo- 
del produced by Ann, without the elements deleted by Bob by way of m2 
(DETAILS. Discount). The expression All(m) — Doma,m(m_m 2 ) produces a 
selector that holds the elements of m that do not appear in m2. The images 
of these elements obtained by traversing m_mi into mi are then deleted. 
Analogously, m ' 2 contains all of m2 without the elements deleted by way of 
m\, such as ORDERS. PONum. 

In line 5, we extract a portion m\ x of m\ that comprises only the 
elements added by Ann (e.g., PRODUCTS. PDesc) and their support ele- 
ments (e.g., PRODUCTS). We achieve this by traversing the added ele- 
ments All(mi) — Range(m_mi) from mi to rri\ . Line 6 does a similar job for 
m2x- Notice that line 5 could be realized alternatively as (mi x ,m!i_m\ x ) = 
Diff(mi, Invert(mi_m , 1 ) * Invert (m_mi)); 

In line 7, we compute the mapping mi x _m2 X —Core between m\ x and m2 X 
to establish the correspondences between the support elements of m\ x and 
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Fig. 3.10. Schematic representation 
of the reintegration scenario 



m 2 x- This mapping is then used to drive the Match between the added 
portions in line 8. Here, the engineer executing the script has a chance to 
decide whether or not ORDERS. Rebate added by Ann is equivalent to OR- 
DERS. Rebate added by Bob. Notice that this Match is relatively inexpensive 
to perforin, since we only have to reconcile the additions introduced by Ann 
and Bob. 

In line 9, we compute the mapping between m[ and m' 2 to drive the 
Merge in line 10. To compose m\ rn ', 2 , we need to consider both “paths” 
between the two models. One of them includes the matches between the 
added elements, mia^jrri 2 x, and the other goes over the original model m. 
Similarly, the mapping rri 3 _m is obtained in line 13 by joining two paths, one 
going through m\ and the other through TO 2 , portions of which are computed 
in lines 11-12. In line 14, the results of the script execution are returned. 



3.6 Conclusions 

In Chapters 2 and 3 we presented a programming platform for model manage- 
ment that implements all generic operators suggested so far in the literature. 
We explored the use of morphisms and selectors and introduced several no- 
vel generic operators. We discussed the structural operator semantics and 
the algorithms that we developed for implementing them. We showed that 
introducing a new model type like SQL DDL schemas in our prototype re- 
quires a moderate programming effort, but brings a large new class of model- 
management tasks within reach. 

The main conclusions that we draw at this point are the following: 

1. One can solve practical problems using the model management operators. 

2. The solutions require a relatively small amount of code. 

3. One can get quite far using a relatively weak representation for models 
and mappings. 

The operator definitions presented in Chap. 2 above are mostly syntactic, 
just like the conceptual structures, and are expressed as graph transformati- 
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ons. Focusing on syntax allows the operators like Match or Merge to be imple- 
mented in a generic fashion for different kinds of models, as we demonstrated 
in Chap. 3. However, a deeper understanding of the semantics of these ope- 
rators is crucial for assessing the correctness of model-management scripts. 
That is, the effect of applying “syntactic” operators to schemas ultimately 
needs to be expressed in terms of what these operators do to the instances 
of these schemas. Conditions (i)-(iv) for the Extract operator (Sect. 2.3.3), 
or (i)-(ii) for Merge (Sect. 2.3.5) reflect the semantics of these operators to a 
limited degree. 

In Part II, we discuss the instance-level semantics of the model- 
management operators. It allows us to characterize the properties of the 
operators without assuming a particular meta-meta model representation of 
models and mappings. 



A Semantics for Model Management 

Operators 




4. State-Based Semantics 



“He who loves practice without theory is like the sailor who bo- 
ards ship without a rudder and compass and never knows where he 
may cast.” 

- Leonardo da Vinci (1452-1519) 



In Part I, we described the first prototype for model management, called 
Rondo, which offers a set of high-level operators for solving metadata-related 
problems. Using Rondo, we developed scripts for several practically relevant 
scenarios, such as change propagation, view reuse, and reintegration. We fo- 
und that the scripts produce intuitively correct results and that the structural 
operator definitions that we give are useful for solving practical problems. 

In Chap. 2, we defined the semantics of the model-management operators 
for morphisms, a very simple mapping language. A morphism is represented 
as a set of arcs connecting the elements of two schemas. Although the desired 
result of the operators seems intuitively clear when morphisms are utilized, 
the treatment in Chap. 2 provides little guidance with respect to what results 
the operators should return if SQL views, XQuery, Datalog, or other more 
expressive languages are deployed in scripts instead of morphisms. 

In this chapter, we present a way of defining the semantics of the ope- 
rators in a truly generic fashion, without assuming any specific model and 
mapping languages. The main idea of our approach is to express the effect of 
applying the operators to models in terms of what the operators do to the in- 
stances of these models. For example, the effect of applying the operators to a 
database schema is expressed in terms of the valid database states described 
by the schema. In this way, we can characterize the semantics of operators 
without relying on any particular meta-model or meta-meta model. We call 
this kind of semantics state-based , or instance-based , semantics. In contrast, 
the semantics defined in Chap. 2 is driven by the structural properties of 
models, i.e. , by the relationships between the individual models elements. To 
distinguish the state-based definitions from the structural definitions, in this 
chapter we use a distinct font face for the operators. Thus, we write Extract 
instead of Extract, and denote the composition using o instead of *. 

S. Melnik: Generic Model Management, LNCS 2967, pp. 55-89, 2004. 

© Springer- Verlag Berlin Heidelberg2004 
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This chapter is structured as follows: 

— In Sect. 4.1, we introduce the state-based approach and formal nota- 
tion used in this chapter. We define the state-based semantics of model- 
management scripts and explain what it means to execute a script. 

— In Sect. 4.2, we specify the state-based semantics of the operators. We 
derive alternative formulations of operator definitions that are substantially 
easier to verify for concrete schema and mapping languages. We present 
detailed examples of how the operators can be applied to relational schemas 
and SQL views. 

— In Sect. 4.3, we consider in more detail the problem of computing the 
results of a script, which we refer to as materialization. 

In Chap. 5, we revisit the change propagation scenario from the state- 
based perspective, and address the relationship between the structural and 
state-based operator definitions in Chap. 6. 

Specifying the state-based semantics of the operators allows us to lay out 
a clear extensibility path for supporting more complex mapping languages in 
our prototype. Furthermore, it helps us study formally the properties of the 
operators and the behavior of model-management systems: the existing ones, 
such as Rondo, as well as systems that will be built in the future. 



4.1 Basic Concepts 

In this section we present the concepts of a model and a mapping and explain 
the notation used in the rest of the chapter. For clarity, in the examples that 
we give we put schema and mapping definitions in French quotation marks 
«. . .». For example, «R(A,B), S(C)» denotes a relational schema with two 
tables, R and S. Furthermore, we distinguish between the set semantics for 
relational tables, when the table is not allowed to contain duplicate tuples, 
and the multiset semantics used in SQL. For the former we write «R(A,B)», 
for the latter we use square brackets: «R[A,B]». We abbreviate SELECT 
DISTINCT clauses in view definitions as SELECTD. 

4.1.1 Models 

A model is a formal description of an application artifact, such as a relational 
schema, a workflow definition, or an interface specification. Typically, a model 
serves as a template for instances. For example, an instance of a relational 
schema is a valid database state; an instance of a workflow is a valid transition 
graph; an instance of a programming interface is an implementation that 
conforms to the interface. Let lnst(m) denote the set of all possible instances 
of m. If m is a database schema, every instance db G lnst(m) must satisfy the 
constraints present in m. 
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Definition 4.1.1 (State-based semantics of models). The state-based 
semantics of model m is defined as the set of all possible instances of m, 
denoted as lnst(m). ■ 

We do not specify the nature of instances of models any further. 

Example 4.1.1. Let m be the relational schema «R(Name: char(3), Sex: 
bool)». Several instances of this schema are shown in Fig. 4.1. Attributes 
Name and Sex can take |char(3)| and |bool| different values, respectively. 
Using these we can construct |char(3)| • |bool| different tuples to populate 
the table R. Any subset of these tuples describes a valid database state of m. 
Thus, schema m has 2l c ^ iar (^)l'|bool| va ]j ( j instances. Notice that an instance 
of m is not an individual string or Boolean value, but an entire populated 
database. ■ 



Inst(m) 




Fig. 4.1. Some instances of relational schema 
R(Name: char(3), Sex: bool) 



Example 4.1.2. Consider the schema «R[A: bool]» (multiset semantics). 
Each ordered list of Boolean values is a valid instance of the schema. Thus, 
the schema has infinitely many instances. ■ 

Definition 4.1.1 intensionally leaves the concept of model unspecified. 
Other semantics, e.g., the structural semantics defined in Chap. 2, which 
establishes the connection to the representation of m in a concrete meta- 
meta model, can be introduced using different function symbols, such as 
Struct(m). Model m itself is not identical to lnst(m). Instead, lnst(m) provi- 
des a mechanism for describing a part of the semantics of m, its state-based 
semantics. 

A model m can itself be an instance of another model. The latter is called 
the meta-model of m. For example, a relational schema can be viewed as an 
instance of a meta-model that describes the concepts Table, Column, Data- 
type, ReferentialConstraint, etc. Such a meta-model for relational schemas 
can be defined, e.g., using a UML diagram (Bernstein et al. 1999). 
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4.1.2 Mappings 

A mapping establishes a semantic correspondence between models. A map- 
ping between m\ and m2 identifies all mutually consistent states of mi and 
m2, i.e., those states that can exist at the same time in an application that de- 
ploys the mapping. For example, consider an application that uses a program 
mi_W2 to generate XML reports complying with DTD m2 from a database 
with schema mi. The state-based semantics of m±_m2 tells us whether a 
given database state x € m 1 and a given document y € m2 are mutually 
consistent, i.e., whether y could possibly have been generated from x. Thus, 
the program defines a binary relation on the instances of mi and m2. 

Definition 4.1.2 (State-based semantics of mappings). The state- 
based semantics of a mapping m\_m2 between models m\ and m2 is defined 
as the binary relation Inst(mi_m2) C Inst(mi) x lnst(m.2). ■ 

Example 4- 1-3. Let 

mi = «R(ID, Age)», 
m 2 = «S(SSN)» 

be relational schemas. Let the mapping map be defined using a relational 
algebra expression as 

map = «7Ti D ((7Age=2o(R.)) = S» 

Formally, (c?6i,d&2) G Inst(map) if and only if 7TiD(o'Age=2o(c^i-R)) = d&2-S. 
A portion of map is shown in Fig. 4.2. Let mapping map' be specified as an 
SQL view definition, 

map' = «CREATE VIEW S(SSN) AS 

SELECTD ID AS SSN FROM R WHERE Age=20» 



Inst(«R(ID, Age)») Inst(«S(SSN)») 




Fig. 4.2. Portion of a mapping 



Mappings map and map' are equivalent, i.e., Inst(map) = Inst(map'). Alt- 
hough map and map' are expressed in different languages they both describe 
the same correspondence between mi and m2. Now, let mapping map" be 
specified by the view definition 
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map " = «CREATE VIEW R(ID, Age) AS 

SELECTD SSN AS ID, 20 AS Age FROM S» 

Mappings map' and map" are not equivalent. To see that, notice that by 
applying the view map' to the database state (1,1) £ mi we get 0 £ m 2 , 
but we cannot obtain (1, 1) from 0 using the view map". That is, ((1, 1), 0) € 
Inst(map'), but ((1,1), 0) ^ lnst(map"). In fact, lnst(map") C lnst(map / ). 



Example j.l.j. Let 

mi = «R(ID, Age)», 
m 2 = «S(SSN, Tel)», 
map = «7Tid(R) = 7tssn(S)s> 

That is, (dbi,db 2 ) £ Inst(map) iff 7rii)(<i&i.R) = 7 Tssn(^& 2-S). The mapping 
establishes a correspondence between two databases dbi G Inst(mi) and db 2 £ 
Inst(m 2 ) whenever they agree on the values of ID and SSN. This mapping 
cannot be expressed using a SQL view definition because it is non-functional 
and not injective; an instance of mi does not determine uniquely an instance 
of m 2 , nor the other way around. ■ 

A mapping can be thought of as a set of constraints that hold between two 
models. If Inst(roi_ro 2 ) yields the whole cross-product Inst(mi) x Inst(m 2 ), 
the relationship between the models is unconstrained, i.e. , each pair of states 
x £ mi, y £ m 2 are mutually consistent. For example, think of a mapping 
between a university payroll database schema and an airline reservation data- 
base schema. Such mapping is likely to be unconstrained. If Inst(mi_m 2 ) = 0, 
the mapping can be viewed as a contradictory set of constraints. In general, 
Inst(mi_m 2 ) is an arbitrary relationship between instances. It may have 
several properties whose definitions are summarized below: 

Definition 4.1.3. Let X and Y be two sets. A relation r C X x Y is fun- 
ctional, if (x,y 1 ),(x,y 2 ) £ r implies yi = y 2 ; injective, if (aq, y), (x 2 , y) £ r 
implies x\ = x 2 ; total, if {x | 3 y : (x,y) £ r} = X; surjective (onto), if 
{y | 3x : (x,y) £ r} = Y. A total, functional, injective, surjective relation is 
called a bijection. ■ 

If Inst(mi_m 2 ) is bijective (surjective, functional, etc.), we call the res- 
pective mapping a bijection (surjection, function, etc.). 

In the examples that we give in the subsequent sections, we refer to cer- 
tain kinds of mappings as database transformations, views, and queries. A 
database transformation specifies how an instance of one schema is transfor- 
med into an instance of another schema. That is, a database transformation 
is a functional mapping or simply a function. This function does not need to 
be total: there may be certain database states on which the transformation 
is undefined due to a violation of integrity constraints assumed by the trans- 
formation but not enforced by the schema. A database query is a database 
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transformation defined for each initial database state, i.e., a total functional 
mapping from some model to. We use the term query and view synonymously. 
More precisely, a database view is a named query, whose result schema, called 
view schema, can be specified explicitly. This distinction is unimportant for 
the purposes of state-based semantics. 

In this dissertation, we focus on binary mappings, i.e., those that hold 
between exactly two models. We consider n-ary mappings in Chap. 11 when 
we discuss future work. 

4.1.3 Formal Notation 

In this chapter, we consider only the state-based semantics of models and 
mappings, as described by the instantiation function Inst. To simplify the 
notation, we henceforth interpret the variables used for models, such as TOi 
or TO 2 , as set variables, or unary predicate variables. That is, instead of writing 
db £ lnst(?n) and Inst(mi) = Inst(m2), we simply write db £ m and toi = m 2 - 
Similarly, we consider mapping variables, such as toi_TO 2 or map , as binary 
predicate variables and write (db±, db 2 ) £ TOi_TO 2 instead of the more verbose 
(dbi,db 2 ) £ lnst(mi_TO 2 ). We borrow this notation from the second-order 
logic, which has variables and quantifiers not only for individuals but also for 
subsets of the universe and for n-ary relations. Despite using this simplified 
notation, we stress that a model is more than a set of instances - the latter 
only characterizes its state-based semantics. 

The concrete schema and mapping definitions such as «R(ID, Age)», 
«SELECT A FROM R», or «7 Ta(R) = S> are interpreted as constant sym- 
bols. Notice that a mapping definition is often closely coupled with the models 
it relates. For example, a relational algebra expression or a SQL view makes 
sense only when we have the definitions of the schemas it applies to. That is, 
a more correct notation for the mapping of Example 4.1.3 would be map = 
« 7r iD(o'Age= 2 o(R-)) = S :: R(ID, Age) :: S(SSN)», which identifies the “left” 
and “right” models precisely. We continue to use the shorthand notation 
when the participating models are clear from the context, or abbreviate as 
«7TiD(0'Age=2o(R)) = S :: R :: S» where appropriate. 

For mapping variables we often use the underscore notation, such 
as mi_m 2 - By convention, in a constant assignment such as TOi_TO 2 = 
«CREATE VIEW R(A) AS SELECT A FROM S :: R(A) :: S(A,B)», the left 
instances of the mapping are assumed to originate from «R(A)» while the 
right instances originate from «S(A,B)». The create view statement suggests 
that the instances of «R(A)» are obtained as a function of the instances of 
«S(A,B)». That is, toi_TO 2 above is non-functional, whereas Invert(mi_m2) 
is a function. The underscore notation has no special semantics and in par- 
ticular does not ensure that Domain(mi_m.2) C mi. We state the necessary 
conditions explicitly where needed. 

We describe the application of model-management operators using ope- 
rational notation. For example, we write (m X: m_m x ) = Extract (m,map). 
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Here, the operator Extract takes two variables as input and produces two va- 
riables as output. We use this notation instead of the predicate-centric one, 
Extract (m x ,m_m x ,m,map), because the former is more intuitive. In fact, 
in all operator definitions that we give, the variables can be clearly divided 
into input and output variables; the operator definition places constraints on 
the output variables based on the values of the input variables. Neverthe- 
less, formally the operator Extract is a quadrary predicate, Invert is a binary 
predicate, Merge is a sextary predicate etc. 

4.1.4 Semantics of Scripts 

In this section we explain what it means to compute the results of a model- 
management script. A model-management script is a conjunction of formulas 
built of the model-management operators and free variables and constants 
for models and mappings. That is, a script is a logical formula. The scripts 
have declarative semantics, which is defined as the standard model-theoretic 
semantics for logical formulas. Hence, the order of operator “invocations” in 
a script is irrelevant. We call two scripts t\ and t 2 equivalent , denoted as 
t\ = t 2 , when they are logically equivalent formulas. The fact that it may 
be possible to rewrite a script into another equivalent script provides the 
foundation for optimizing the script execution. 

Since a script is a formula, executing the script amounts to finding a 
variable substitution that satisfies it, or makes the script true. Recall that 
the variables in a script range over relational schemas, SQL views, and other 
kinds of models and mappings. To compute the results of the script effectively, 
we construct concrete schema and mapping definitions that make the script 
true. 

In many cases it is impossible to represent the results of scripts exactly 
using existing schema or mapping languages due to the limited expressiveness 
of the languages. For example, if we compose a SQL view with an XQuery or 
with a set of Datalog rules, it may be impossible to describe the result of the 
composition using a closed expression in an existing database transformation 
language, so that a special language may need to be invented to hold the result 
(Shanmugasundaram et al. 2001a). Even if we compose two transformations 
in the same language, the result may not be expressible in that language, or 
may produce an infinite set of formulas (Madhavan and Halevy 2003) . To use 
the result of composition in practical applications, we may have to construct 
a transformation that is equivalent to the one we are looking for except that 
it covers some irrelevant database states. 

The problem of limited expressiveness arises for database schemas, too. 
For example, in (Buneman et al. 1992) the schema language had to be exten- 
ded to make sure that all schemas obtained by merging two input schemas 
can be represented explicitly. In practical applications, it may often be accep- 
table to construct a more expressive schema if some of the schema constraints 
are not representable in the target language. Intuitively, we call a schema 
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more expressive if it allows more valid database states. For example, sup- 
pose that schema «R(A,B), S(B,C); 7 Tb(R) = 7Tb(S)> is an exact result of a 
script. The schema defines two tables with a schema constraint. Assume that 
we need to deploy this schema with a SQL DBMS. The set semantics can 
be enforced by defining two unique keys over the attributes of each table. 
The constraint 7 Tb(R) = 7Tb(S) is however not expressible in the standard 
SQL DDL. (If B is a unique key of S, then a foreign key constraint can 
express that 7Tb(R) C 7Tb(S). If B is not a key of either table, then neither 
7Tb (R) C 7t b (S) nor 7Tb(S) C 7Tb(R) is expressible.) Still, it may be acceptable 
to delegate the enforcement of the constraint to the application and use a 
more expressive schema <dR[A,B], S[B,C]» that can be defined in SQL DDL. 

In many other cases, multiple schema or mapping definitions 
may satisfy the script. For example, if the variable assignment 
map := «7TiD(o'A ge =2o(R)) = S» makes a script true, so do map := 
«S = 7ri D (cr A ge=2o(R))» and map := «S = 7ri D (o-A g e=2o(7riD,Age(R)))»- 
These expressions are equivalent under state-based semantics, but differ with 
respect to their syntactic representation. Human input or tuning parameters 
are required to specify the desired result in such cases, much like a format 
specification is needed to specify whether the floating-point number 1.3 is 
to be printed out as “1.3” or “0.13E1”. The script execution environment, 
such as Rondo, may provide such format specifications implicitly. We do not 
consider them in this chapter. 

The problem of computing the results of scripts effectively appears as one 
of the most challenging and exciting open issues in model management. As 
mentioned above, this problem is very hard even if we consider relatively 
simple languages and just a single operator, such as Merge (Buneman et al. 
1992; Pottinger and Bernstein 2003) or Compose (Madhavan and Halevy 
2003). We address it in more detail in Sect. 4.3 and in Chap. 10. 

4.1.5 Preliminaries 

From now on, we use the notation introduced in Sect. 4.1.3, without the 
instantiation function Inst. 

Definition 4.1.4 (Submodel). A model m! is called a submodel of m if 
all instances of m! are also instances of m, i.e., m' Cm. I 

Definition 4.1.5 (Subordinate model). A model m! is called a subordi- 
nate model of m, denoted as m! < m, if m ’ has at most as many instances 
as m, i.e., there exists a surjective function from m onto m' . ■ 

If m! < m, we say that m' is equally or less expressive than to, or is 
dominated by m (Hull 1986). 

Definition 4.1.6 (Minimal model). Model m m i n is a minimal model of 
the class C = {toi, . . . ,mk} if m m i n € C and m m in < m* for each mi € C . 
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Example 4- 1-5. Schema 

mi = «S(Namc: char(3), Sex: bool)»- 

in which Name is a primary key, is a submodel of 

m 2 = «R(Name: char(3), Sex: bool)». 

In general, if m is a database schema, adding a constraint to m yields a 
submodel of m. Both mi and m 2 are subordinate models of 

m 3 = «T(Name: char(4))». 

All of mi, m 2 , and m 3 are subordinate models of 

m 4 = «U(FN: char(2), LN: char(2))». 

mi is not a subordinate model of m 3 , i.e. , it describes more database states 
than m 3 . Indeed, observe that two strings of size < 2 cannot always be 
encoded losslessly in a string of size < 4. For example, concatenations “a” + 
“be” and “ab” + “c” both yield “abc”. 

Models mi and m 2 are minimal models of the class {mi, m 2 , m 3 , 7714 }. 



Definition 4.1.7 (Equivalence). Models m and m! are equivalent if they 
have identical instance sets, denoted as m = m' . ■ 

Definition 4.1.8 (Equipotence). Models m and m' are equipotent, or 
equally expressive, denoted as m = m! , if m has exactly as many instances 
as m' , i.e., there exists a bijection between m and m! . ■ 

Example 4- 1-6. Schemas 

mi = «S(A: char(3), B: bool)^, 
m 2 = «R(Name: char(3), Sex: bool)» 

are equivalent. They are not identical: in the abbreviated notation introduced 
in Sect. 4.1.3, m = m' is a shortcut for lnst(m) = lnst(m / ). However, if the 
full notation is used, lnst(m) = Inst(m') is not equivalent to and does not 
imply m = m! . Schema 

m 3 = «T(Val: int(1..33686018))» 

is equipotent with mi and m 2 assuming that the characters are drawn from 
an alphabet of size 256; in this case, |char(3)| • |bool| = 33686018 (compare 
Example 4.1.1). ■ 

We borrow our definitions of equivalence and equipotence from the stan- 
dard set theory. Notice that schema equivalence is defined differently in (Hull 
1986; Miller et al. 1994). Their definition corresponds to that of equipotence 
(Definition 4.1.8). 
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4.2 Operators 

In this section we define the state-based semantics of the key model- 
management operators. The signatures and informal descriptions of the ope- 
rators are summarized in Table 4.1. Wherever possible, we illustrate the re- 
sults of operators using examples of concrete schema and mapping languages. 
We also point out when the results cannot be represented exactly in the langu- 
ages that we consider. Some of the examples we give are non-trivial. In these 
cases, we provide the proofs for the propositions stated in the examples. 



Table 4.1. Summary of key model- management operators 



Signature 


Description 


mi_m.3 = Compose(mi_m2, m,n ms) 


Composes mappings mi_ms and 


= m\_rri 2 o m^mz 


m2_mz 


(m x ,m_m x ) = Extract(m, m_m') 


Extracts a subordinate model m x of 
m that participates in mapping m_m' 


(m, m_mi , m_ms } = 


Merges models mi and ms using 


Merge(mi, m2, mi_m2) 


mapping mi _^2 


(mj, m_md) = Diff(m, m_m ') 


Returns a subordinate model mj of 
m that does not participate in map- 
ping m_m' 


maps = Confluence(mapi, maps) 


Combines mappings map i and maps 


= mapi © map 2 


into mapping maps 


m i_m2 = Match(mi, m2) 


Returns a mapping m\_ms between 
mi and ms 


Auxiliary operators 


mi_rri 2 = m± X m 2 


Returns the “unrestricted” cross- 
product mapping between models mi 
and ms 


map = ld(m) 


Returns the identity mapping map 
for model m 


m = Domain(map) 


Returns model m that holds the 
instances in the domain of map- 
ping map 


m = Range(map) 


Returns model m that holds the in- 
stances in the range of mapping map 


maps = Invert(mapi) 


Swaps the “left” and “right” side of 
the input mapping mapi 



We use the following auxiliary operators: 

1. mi x TO 2 =df {{x,y) | x £ mi and y £ m 2 } defines the cross-product of 
two models. 

2. Id(m) =df {( x,x ) | x £ m} is the identity mapping on m. 

3. Domain (map) =df {x | ( x,y ) £ map}. 

4. Rang e(map) = df {y | (a :,y) £ map}. 

5. Invert(mi_m 2 ) =df {(y, 2 ;) | (x,y) £ mi_m 2 }- The operator Invert is 
discussed in more detail in Sect. 4.2.2. 
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4.2.1 Compose Operator 

To motivate the definition of the Compose operator, consider the following 
example. 

Example 4 . 2 . 1 . Let mi_rri2 be a mapping between a relational schema mi 
and an XML schema m2 used for data exchange, TOi_TO 2 C mi x m2. For a 
given database x £ mi the mapping generates an XML document y £ m2- 
Assume that the XML schema m2 has been modified into schema m3. Let 
m^ m3 C m2 x m3 be the mapping between the original and the new XML 
schema. To derive the updated export mapping, we compute the composition 
of m%_rri2 and m2— m3, denoted as Compose(mi_m2,m2_m3) or simply as 
mi m2 o m.9 m3. ■ 

Definition 4.2.1 (Compose, o). 

mi_m 2 o m2_m 3 =df {(2, z) \ {x, y) £ mi_m 2 and (y, z ) £ m2_m 3 } ■ 

Obviously, mi_m2om2_m3 C mi x m3. Next, we consider three examples 
of composition of SQL views. In each of the examples, the views have distinct 
directionality: mi —> m2 —> m3, mi —> m2 <— m3, or mi £- m2 —> m3. 

Example 4 - 2.2 (mi — > m2 —> m3). Let 

mi = «R(A,B), S(B,C)», 
m 2 = «T(A,C)», 
m3 = «U(A)», 

mi_m 2 = «CREATE VIEW T(A, C) AS 

SELECTD R.A, S.C FROM R, S WHERE R.B=S.B», 
m2— m3 = «CREATE VIEW U(A) AS 

SELECTD T.A FROM T WHERE T.C=5» 

Then, the composition m\_m3 = m\_m2 o m2— m3 can be specified as 

«CREATE VIEW U(A) AS 

SELECTD R.A FROM R, S WHERE R.B=S.B AND S.C=5» 

Proof: Observe that mi_m,2 = «7ta,c(R X S) = Ts> and m ? m3 = 
«7 ta(<jc=5 (T)) = U». That is, (x,y) £ mi_m2 iff 7 Ta,c(£-R X £.S) = y.T. 
Similarly, (y,z) £ m2— m3 iff 7 ta(<tc=5 (2/-T)) = z.U. By Defini- 

tion 4.2.1, mi_m3 = {(x, z) | (x, y) £ mi_m2 and (y, z) £ m2— m3} = 
{(x,z) | 7r A ,c(a:.R M i.S) = i/.T and 7r A (crc= 5 (2/-T)) = z.U} = {(x,z) \ 
7 T A ( cr C=5( 7r A,c( 2 '-R 1x1 x.S))) = Z.V} = {(x,z) | 7Ta(ctc=5(^-R X x.S)) = z. U}. 
That is, mi_ m3 = «7ta((7c= 5(R X S)) = U», which is equivalent to the 
above view definition. ■ 

Example 4 - 2.3 (mi m2 <— m3). Let 
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m 3 = «T(A,C)», 

m 1 _m 2 = «CREATE VIEW U(A) AS 

SELECTD R.A FROM R, S 

WHERE R.B IN (SELECTD S.B FROM S)», 
TO2_m 3 = «CREATE VIEW U(A) AS 

SELECTD T.A FROM T WHERE T.C=5» 

(The mapping TO 2 _to 3 is described using a view definition that maps m 3 to 
m 2 .) Then, 

mi_m 3 = mi_m 2 o m 2 _m 3 = «7 ta(R X S) = 7ta(ctc=5(T))». 

There is no equivalent view definition for this relational algebra expression 
because the mapping mi_m 3 is non- functional and not injective. 

Proof: Observe that mi_m 2 = «7Ta(R X S) = U» and m 2 _m 3 = 

«7TA(crc=5(T)) = U». We obtain the proposition using the same argu- 
ments as in Example 4.2.2. ■ 

Example 4-2-4 (Vn 3 £- m 2 — > m 3 ). Let 

mi = «T(A,C)», 

m 2 = «R(A,B), S(B,C)», 

m 3 = «U(A)>. 

Let |A|, |B|, |C| be domain sizes of attributes A, B, and C. Assume that 
|B| > |A| • |C|. Further, let 

CREATE VIEW T(A, C) AS 
SELECTD R.A, S.C FROM R, S WHERE R.B=S.B», 
CREATE VIEW U(A) AS 
SELECTD R.A FROM R WHERE R.B=3». 

Then, the composition mi_m 3 = mi_m 2 o m 2 m 3 is unrestricted, i.e., 
TOi_to 3 = mi x m 3 . Obviously, mi_m 3 cannot be represented using a SQL 
view. 

Proof: Observe that m\_m 2 = «7ta,c(R M S) = T» and to ? m 3 = 
«7r A (o-B =3 (R)) = U». Thus, TOi_to 3 = {{x,z) \ n A:C (y-RM y.S) = x.T 
and 7rA(<TB= 3 (y.R)) = z.U}. Now we show that for each x £ mi and for each 
2 £ to 3 we can find y £ m 2 so that the above condition holds. Assume that 
database instances x and z are given. We construct the database y using the 
following view definitions: 

CREATE VIEW R(A, B) AS 

(SELECTD U.A, 3 AS B FROM U) UNION 
(SELECTD T.A, Sk(T.A, T.C) AS B FROM T) 

CREATE VIEW S(B, C) AS 

SELECTD Sk(T.A, T.C) AS B, T.C FROM T 



TOi TO 2 = 

TO 2 TO 3 = 
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where Sk(.,.) is a Skolem function that generates a distinct value b = Sk(a, 
c) ^ 3 from the domain of B for each pair of A and C values. Such a Skolem 
function exists, since |B| > |A| • |C|. When this property is not guaranteed, 
we get mi_rri 3 = {(x,z) | 7ta,c( 2/-R 1x1 2 /-S) = x.T}, i.e., the “right side” of 
the mapping is unconstrained. I 

Proposition 4.2.1. Operator Compose is associative, i.e., mapi o (map 2 o 
maps) = {mapi o map 2 ) o map 3 . 

Proof: follows from Definition 4.2.1. ■ 



4.2.2 Invert Operator 

The operator Invert swaps the “left” and “right” side of a mapping. For 
convenience, its definition is repeated below: 

Definition 4.2.2 (Invert). Invert(mi_m 2 ) =df {{y,x) | (x,y) £ 



By reversing the sides of a mapping we can ensure that its directionality 
fits other operations, such as composition or merging. An inverted mapping 
still describes the same correspondence between two models. When we use 
mapping constants, we have to specify the “left” and “right” schemas ex- 
plicitly, as we explained in Sect. 4.1.3, to distinguish the mapping from its 
inverted mapping. 

Example 4-2.5. Consider the mapping 

map = «7TiD(a'A g e=2o(R-)) = S :: R(ID, Age) :: S(SSN)» 

from Example 4.1.3. The inverted mapping can be represented as 

Invert(map) = «7TiD(<7A g e=2o(R)) = S :: S(SSN) :: R(ID, Age)>. ■ 

As mentioned earlier, the state-based semantics does not prescribe the 
exact syntactic representation for models and mappings. For example, we 
cannot state that Invert(map) in the example above should be computed as 
a view definition, and not as a relational algebra expression, and that in 
this view definition the relation R should be defined as a view on S and not 
vice-versa. This constraint is part of the structural semantics. Still, using the 
state-based semantics we could tell that such a view on S cannot possibly 
exist, since it would require Invert(map) to be a function, but the mapping 
map is not injective. 

The following propositions summarize some important well-known alge- 
braic properties that we use in subsequent sections: 



Proposition 4.2.2. Invert(lnvert(map )) = map. 
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Proposition 4.2.3. Invert(map± o map 2 ) = Invert(map 2 ) o lnvert(map\) . 



Proposition 4.2.4. Mapping map is a surjective function onto m if and 
only if lnvert(map) o map = ld(m). ■ 

4.2.3 Extract Operator 

The operator Extract takes a model m and a mapping map between m and 
some model m! , and returns a subordinate model m x of m that “participates” 
in the mapping. Before we give a formal definition of Extract, we explain the 
intuition behind this operator using a motivating example. 

Example f.2.6. Imagine that m is a legacy database schema and q is a query 
over m. Our goal is to upgrade the legacy database by producing a new 
schema m x that captures only the information that can actually be queried 
using q and no other information (see Fig. 4.3). That is, m x is a minimal 
schema that still allows us to obtain all query results that we can obtain by 
running q against to. In addition to the new schema m x , we need a data- 
base transformation m_m x that tells us how the data of to can be migrated 
to m x . After migrating all instances of m to m x , we can reformulate our 
original query q to run against m x . To do this, we compose the reverse trans- 
formation lnvert(TO_TO x ) and q. Notice that q may yield the same query result 
for multiple different database states of to, because m includes irrelevant le- 
gacy information. Such database states are indistinguishable under q and are 
“collapsed” into a single database state of m x by way of m_m x . ■ 



new schema view legacy DB query 




query against new schema: Invert(m_m x ) • q 



Fig. 4.3. Schematic representa- 
tion for Example 4.2.6 (Extract) 



The definition of Extract that we present below describes formally the 
properties of m x and m_m x in the above example. The definition covers a 
general case in which q is an arbitrary, possibly non-functional mapping. 

Definition 4.2.3 (Extract). Let Domain(q) C to. (m x ,m^_m x ) = 
Extractfm , q) holds if and only if 

i. m x = Range{m_m x ) . 

ii. m_m x o I nvert(m_m x ) o q = q. 

Hi. m x is a minimal model satisfying (i) and (ii). I 
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To tie the definition to the motivating example, observe that m_m x is the 
database transformation from m to the new schema m Xl while lnvert(?n_ma;)o 
q is the updated query over m x . Hence, condition (ii) requires the updated 
query over m x to produce the same results as the original query q over m. 
Notice that in general, m_m a; olnvert(m_m a ,) is not one-to-one, i.e. , condition 
(ii) is not a tautology. 

Definition 4.2.3 specifies a quadrary predicate. Whether it holds or not 
for fixed m x , m_m x , m, q is hard to verify formally because condition (ii) is 
a non-trivial expression and condition (iii) involves a test over all models m x 
that satisfy (i) and (ii), as required by Definition 4.1.6. Therefore, before we 
give detailed examples of Extract, we prove the following theorem that allows 
us to reformulate the Definition 4.2.3 into an equivalent set of conditions 
that are substantially easier to verify. In the theorem, we utilize an auxiliary 
predicate ind (for “indistinguishable”) 

ind{y 1 ,y 2l m_m') = d f (-j>i | (2/1,21) e m_m'} = {z 2 | (2/2,22) € m_m'}) 

It holds whenever the “projections” of 2/1 and 2/2 are equal. If 
ind(yi,y 2l m_m'), we say that 2/1 and y 2 are indistinguishable under m_m'. 
ind(., is an equivalence relation, i.e., it is reflexive, symmetric, and 

transitive. 

Theorem 4 . 2.1 (Simplification of Extract). Let Domain(m_m’) C m. 

( m x ,m_m x ) = Extract(m,m_m') holds if and only if the following condi- 
tions are satisfied: 

1. m x = Range{m_m x ) . 

2. Domain(m_m x ) = Domain(m_m') . 

3. For all (y 1 ,xi),(y 2 ,x 2 ) e m_m x : x\ = x 2 iff ind(yi, 2/2, m_m') . ■ 

Condition (2) makes sure that exactly those instances of m participate 
in m_m x that are connected in m_m! . Condition (3) requires collapsing any 
two instances 2/1 and y 2 of m into a single instance of m x if and only if 2/1 
and 2/2 are indistinguishable under m_m' . The proof of the theorem is in 
Appendix B. 

The following examples illustrate the operator Extract. 

Example J^.2.1. Fig. 4.4 shows a valid result of applying the operator Extract 
to a model m with seven instances. y 2 and 2/3 are indistinguishable under 
m_m' since they are associated with the same z 2l i.e., ind(y\,y 2 , m_m ') holds. 
Therefore, they are collapsed into a single instance x 2 of m x . All other in- 
stances of m are pairwise distinguishable. For example, 2/5 is associated with 
{^5,2 6 } and 2/6 with { zq } so that y 5 and 2/6 are connected with two distinct 
instances of m x . Instance 2/7 is not connected in m_m! and thus has no coun- 
terpart in m x . I 
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Example 4-2.8. Let 

m = «R(A,B), S(C)», 
m! = «T(A,D)», 

m_m' = «CREATE VIEW T(A,D) AS 

SELECTD A, 5 AS D FROM R 
:: R(A,B), S(C) :: T(A,D)». 

Then, 

m x = «V(A)», 

m_m x = «CREATE VIEW V(A) AS 
SELECTD A FROM R 
:: R(A,B), S(C) :: V(A)» 

is a valid result of extraction. 

Proof: V(A) is a view schema in m_m x so that m_m x is a surjective fun- 
ction and condition (1) is satisfied. Condition (2) is trivially true since all 
instances of m participate in m_m' and m_m x . To verify condition (3), note 
that {z | (y, z) £ m_m'} describes all instances of «T(A,D)» that can be ob- 
tained using the view m_m! on the database y £ m. The view selects A values 
from R, therefore ind(yi,y2,m_m') holds iff 7 ta(2/i-R) = tta(z/ 2 -R)- On the 
other hand, m_m x = «7ta(R) = V», i.e., (y,x) £ m_m x iff 7 ta(j/-R) = x.V. 
Hence, for all (yi,Xi), ( 2 / 2 ^ 2 ) £ m_m x we obtain: ind(yi,y2,rn_m') iff 
7 ta(22i-R) = x\N and 7 ta(?/2-R) = ® 2 -V iff Zi.V = o^.V iff x\ = X 2 ■ Hence, 
condition (3) holds. ■ 

Example 4-2.9. Let 

to = «R(Name: char(10), Salary: real, Year: int)», 
m_m! = «SELECTD Name, SUM(Salary) AS Income 
FROM R GROUP BY Name». 

Then, 

m x = «S(Namc: char(10), Income: real)» 
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is a valid result of extraction. Notice that S.Name is defined as a primary 
key. Its uniqueness is guaranteed by the GROUP BY clause in m_m' . ■ 

Example 4- 2.10 . This example illustrates extraction when m_m' contains a 
join. Let 

m = «R(A,B), S(B,C)>, 
m' = «T(A,B,C)», 
m_m! = «T = R X S». 

Then, 

m x = «P(A,B), Q(B,C); 7t b (P) = 7t b (Q)», 
m_m x = «P = 7ta, b (R MS), Q = 7 t Bj c(R X S)» 

is a valid result of Extract. 

Proof: The proof consists of two parts. First (—>■), we show that each result of 
the query can be kept by a unique instance of m x . Then (■<—), we demonstrate 
that each instance of m x can be obtained as a result of the query. 

(— ►) Notice that the query <cT = cr Be ;;(R) X cr B6 o:(S), 3 = 7r B (R) fl7r B (S)» 
produces the identical result as m_m' . That is, if we first select from R 
and S only the tuples in which B values are shared across R and S, and 
join them, we obtain the same result as by joining R and S directly. Thus, 
we can “shred” any given instance of the result R X S into P and Q with 
7 t b (P ) = 7t b (Q) and reconstruct it using a join P X Q without information 
loss. 

(<—) Observe that 7ta, b (P X Q) = P and 7r B ,c(P XQ) = Q for each P and Q 
such that 7 t b (P ) = 7r B (Q). That is, we can join P and Q into a new table, 
which represents a valid result of the query m_m / , and reconstruct P and 
Q again from this new table. 

Together, (— >) and (<— ) yield that m x is equipotent with the set of pos- 
sible results of the query (condition (3)). Moreover, the construction used in 
(•<— ) tells us that m_m x is surjective (condition (1)), and all instances of m 
participate in m_m! and m_m x (condition (2)). ■ 

In the previous examples, m_m' was a total function from m to m! . Next 
we illustrate the case when m_m' is not a function, but instead lnvert(m_m / ) 
is a function from m' into m. 

Example 4-2-1 1. Let 

m' = «R(A,B), S(B,C)», 
m = «T(A,B,C,D)», 

m_m' = «CREATE VIEW T(A,B,C,D) AS 
SELECTD A, B, C, 5 
FROM R, S 

WHERE R.B=S.B AND S.C=4» 
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Notice that lnvert(TO_m / ) transforms deterministically each instance of m! 
to an instance of m (instances of m' in which no tuple of S matches the 
WHERE clause are mapped to the same instance of to, the empty relation T) . 
Therefore, lnvert(m,_TO') is a total functional mapping. Because of that, for all 
3/i, 2/2 6 m; ind(yi, 3/2 , m_m') iff yi = 3/2 ■ Thus, condition (3) of Theorem 4.2.1 
becomes: for all (2/1, 24), (3/2 , £2) £ m_m x : x\ = X 2 iff yi = 3/2- In other words, 
m_m x and lnvert(m_m x ) must be injective functions. 

Now, observe that lnvert(m_m/) is not onto: there exist instances of m 
that are not computable from any instance of to', such as {(0,0, 0,0)} £ m. 
Condition (3) states that such instances may not participate in m_m x . The 
instances computable from instances of m! are precisely those relations y £ to 
in which every tuple t £ y satisfies the condition (t.C=4 and f.D=5). Thus, 
the injective function m_m x must assign to each such y £ to exactly one 
instance of m x . The mapping 

m_m x = «CREATE VIEW T(A,B,C,D) AS 
SELECTD A, B, 4, 5 FROM V 
:: T(A,B,C,D) :: V(A,B)» 

defines such a function. m_m x is a surjective function onto m x = «V(A,B)», 
i.e. , for each relation x £ «V(A,B)» we can find a source relation y £ 
«T(A,B,C,D) » such that the view transforms y into x. Hence, the condi- 
tions of Theorem 4.2.1 are satisfied. In contrast, specifying m_m x as 

«CREATE VIEW V(A,B) AS 

SELECTD A, B FROM T 
:: T(A,B,C,D) :: V(A,B)» 

violates condition (3). ■ 

Theorem 4.2.2. If ( m x ,m_m x ) = Extraction, m_m'), then m_m x is a sur- 
jective function. 

Proof: Let (y,x\), (y,x 2) £ m_m x . Assume that x\ ^ xi- Then, by condi- 
tion (3) of Theorem 4.2.1, ind{y,y,m_m') is false. This is a contradiction, 
since ind(., ., .) is reflexive. So, our assumption is false, and X\ = X2 ■ Hence, 
lnvert(m_m x ) is a function. m_m x is surjective by condition (1). ■ 

Notice that m_m x is in general not total, i.e., it is a database transfor- 
mation, but not necessarily a view. For an illustration, see Example 4.2.10. 

Proposition 4.2.5. If (m x ,m_m x ) = Extraction, m_m') and (m y ,m_m y ) = 
Extraction, m_m') , then there exists a bijection between m x and m y , namely 

m x TOy = lnvert(rn_m x ) o m_m y . That is, model m x in Definition f.2.3 is 

defined uniquely up to isomorphism. ■ 

Proposition 4.2.6. Let { m x ,m_m x ) = Extract(m,m_m'). If lnvert{m_m') 
is surjective, then m_m x is a total surjective function. If lnvert(m_m') is 
a surjective function, then m_m x is a bijection, i.e., the extracted model is 
isomorphic to the input model. B 



4.2 Operators 



73 



Proposition 4.2.7. Let (m x ,m_m x ) — Extract(m,m_m'). Ifm_m' is a fun- 
ction, then lnvert(m_m!) o m_m x is an injective function. If m_m! is a sur- 
jective function, then lnvert(m_m') o m_m x (and its inverse lnvert(m_m x ) o 
m_m! ) are bijections. ■ 

4.2.4 Merge Operator 

We explain the intuition behind the Merge operator using the following data 
integration scenario. 

Example j.2.12. Consider a company with two departments. Each of the 
departments manages its own database. Let mi and m 2 be the respective 
database schemas (see Fig. 4.5). Schemas mi and m 2 are not disjoint; for 
instance, both databases contain employee data. To simplify the management 
of data consistency across the departmental databases, the company decides 
to keep all data in a centralized database, while the departments access the 
data using view schemas mi and m 2 . Thus, the goal is to create a global 
schema m for the centralized database such that m is minimal, i.e., it captures 
only the information needed by the departments, and no other information. 

Let the mapping mi_m ,2 describe how mi relates to m 2 , i.e., mi_m 2 
identifies all mutually consistent database states x £ m 1 , y £ m 2 ■ In other 
words, each pair (x, y) £ m\_m 2 represents a single valid state z £ m of 
the centralized database. The states x and y need to be “glued” into z in 
such a way that we can unambiguously reconstruct x and y from z using 
two functional mappings, m_m\ and m_m 2 - To prevent information loss, the 
centralized database must be able to represent each state of mi and each 
state of m 2 even if they are not mutually consistent (for example, think of 
a temporary inconsistency that may occur when a negative account balance 
indicating a debt in x £ m\ is disallowed in the billing schema m 2 ). That is, 
m_mi and m_m ,2 are not total, they are database transformations, but not 
views. ■ 



global schema 




mapping 

m l _m 2 



Fig. 4.5. Schematic representation for Exam- 
ple 4.2.12 (Merge) 
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We define the operator Merge as follows: 

Definition 4.2.4 (Merge). Let mi_iri 2 C mi x m2, (m, m_mi, m_m 2 ) = 
Merge(m\, m2, mi_m,2) holds if and only if 

i. m_m\ and m_m.2 ore surjective functions onto mi and m 2 , respectively, 
ii. m\_m2 = lnvert{m_m{) om_m2- 
in. m = Domain{m_m\ ) U Domain(m_m2) ■ 

iv. m is minimal model satisfying (i)-(iii). ■ 

Condition (i) enables us to reconstruct instances x and y from z in a 
unique fashion and ensures that each instance of m 1 and each instance of m2 
is representable in m. The effect of condition (ii) is that the instances x and 
y that we obtain using the database transformations m_m\ and m_m2 are, 
if they both exist, mutually consistent. Condition (iii) requires each instance 
z € m to represent a valid state of affairs, which can be attributed either 
to mi or m2, or both. The minimality condition (iv) prevents Merge from 
“inventing” instances of m that are not absolutely necessary for representing 
all of mi and m2. 

Example 4.2.13. Fig. 4.6 illustrates the Merge operator. The input mapping 
mi_m2 is shown using light-gray lines. ■ 



m j in m-, 




Fig. 4.6. Illustration of Merge operator 



Definition 4.2.4 specifies a sextary predicate. Whether it holds or not for 
fixed m, m_mi, m_m2, mi, m2, mi_m,2 is hard to verify formally even for the 
simple example of Fig. 4.6, since condition (iv) involves a test over all models 
m that satisfy (i)-(iii). Therefore, before we give detailed examples of Merge, 
we present the following theorem that allows us to reformulate condition (iv) 
into another condition that is easier to check. Its proof is in Appendix B. 

Theorem 4.2.3 (Simplification of Merge). Let mi_m ,2 C mi x m2. 

(m,m_mi,m_m 2 ) = Merge{m\,m 2 ,rn\_m 2 ) holds if and only if 

— the conditions (i)-(iii) of Definition 4-2-4 are satisfied, and 
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— \m\ = mergeCard(mi, m 2 , mi_W 2 ), where mergeCard(mi, m 2 , mi_rri 2 ) =df 
7TT.2 1 + |mi — Domain(rni_rri 2 )\ + \m 2 — Range(m\_m 2 )\- 

If mi_m2 is total and surjective, or if m_m\ and m_m2 are total, then 
mergeCard(mi, ??7.2, mi_m 2 ) = 777 - 2 1 - 

Example f. 2 . 1 f. Let 

7771 = «Rl(A), Sl(B)», 

?77 2 = «R 2 (A), T 2 (C)», 

777 1 7772 = «Rl = R 2 ». 

Then, 

m = «R(A), S(B), T(C)», 

777_777i = «Ri = R, Si = S>, 

777_7772 = «R2 = R, T 2 = T» 

is a valid result of Merge. 

Proof: Mappings m_m\ and m_m2 are views on m. Hence, conditions (i) 
and (iii) hold. By definition of nii_m2, (x,y) £ mi_m2 iff x.Ri = 7/.R2. 
By definition of m_m\ and m_rri2'- ( x,y ) £ lnvert(m_777i) o m_m2 iff exists 
z £ m with (z, x ) £ m_mi and (z, y) £ m_rri2 iff exists z with x.Ri = z.R, 
x.S\ = z.S , 7/.R2 = -z.R, y. T2 = z.T iff exists z with z.R = x.Ri = 7/.R2, 
z.S = x.Si , z.T = 7/.T2. Such z exists if and only if x.Ri = 7/.R2- Thus, 
condition (ii) is satisfied. 

We prove (iv) by Theorem 4.2.3. Notice that m_mi and m_m 2 are to- 
tal, so we have to show that |m| = |tt7i_ 7772|. Let x £ mi. Instance y £ m 2 
participates in m\_m 2 iff x.Ri = 7 /.R 2 and 7 /.T 2 is arbitrary. That is, for 
each x £ m\, exactly |«T 2 (C)»| instances of m 2 participate in the map- 
ping. Thus, | ?77 1 — ?77-2 1 = |"ii| ■ |«T 2 (C)»| = | «Ri (A) , Si(B)»| • |«T 2 (C)»| = 
|«Ri(A), Si(B), T 2 (C)»| = 1 777 ). Hence, condition (iv) holds. ■ 

Observe that in the example above, the merged model could also be ex- 
pressed as 

777 ' = «Ri(A), R 2 (A), Si(B), T 2 (C); Ri = R 2 ». 

This is a union of the schema signatures of 7771 and m2, with a constraint con- 
tained in 777 1 7772- Models m' and m are equipotent, ml = 777 . We generalize 

this observation in the following theorem. 

Theorem 4.2.4. Let C be a schema language, in which a schema consists of 
two parts: a schema signature Sig and a constraint expression C . A signature 
Sig is a set of entity definitions {ei, . . . , e^}. C is a formula in the constraint 
language of C, such as first-order predicate calculus. Let 

mi = «Sigi; Ci», 

7772 = «Sig 2 ; C2» 
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be two schemas in C. Without loss of generality, assume that the entity labels 



in Sig\ = {e\, . . . ,ei} are disjoint from those of Sig 2 = {ef, 



•}. Let 



TOi_to 2 be a total surjective mapping expressed as a constraint C over Sig i 
and Sig 2 in the constraint language of C. Then, we can construct a valid 
result for Merge by creating a union of schema signatures of m\ and m 2 and 
a conjunction of constraints C\, C 2 , C , as 

to = «Sigi U Sig2; Ci A C 2 A C~», 
m_TOi = «Ai<i< p ( m i- e i = rn.e\)*>, 
m_m 2 = «Ai<i< g ( TO 2 -ef = m.e\)-» 

where m_m\ and m^_m 2 are views on m. 

Proof: Analogous to that of Example 4.2.14. ■ 

Example j.2.15. Let 

toi = «Ri(A, B)», 
to 2 = «S 2 (A,C)», 

TOi_TO 2 = «7T A (Rl) = 7r A (S 2 )». 

Then, 

to = «R(A,B), S(A, C); 7ta(R) = 7ta(S)», 
to_toi = «Ri = R», 

TO_TO2 = «S2 = S» 

is a valid result of Merge by Theorem 4.2.4. ■ 

Example j.2.16. Let 

toi = «R[A,B]», 
to 2 = «S[A,C]», 

to 1 _to 2 = «SELECT A FROM R = SELECT A FROM S». 

Then, 

to = «T[A,B,C]», 

to_toi = «SELECT A, B FROM T», 
to_to 2 = «SELECT A, C FROM T» 

is a valid result of Merge. 

Proof: Conditions (i) and (iii) are satisfied trivially. We now show condi- 
tion (ii), that Toi to 2 = lnvert(TO_TOi)om_TO 2 . Obviously, Domain(mLTO 2 ) = 

Domain(lnvert(m_TOi)) = Domain(lnvert(TO_TOi) o to_to 2 ) = Toi. Let x S Toi, 
and let Y\ = {y | ( x,y ) £ toi_?tt, 2 }, Y 2 — {y | ( z,x ) £ to_toi and ( z,y ) £ 
to_to 2 } ■ Condition (ii) holds if Y\ = Y 2 . In fact, Y\ describes all database 
states of ?n 2 such that for each y £ Y), the S.A column of y equals the R.A 
column of x and the S.C column is unconstrained. In contrast, if we traverse 
lnvert(m_mi)oTO_m 2 from x, we first obtain all z £ m that agree with x on co- 
lumns A and B, and then get all y that agree with x on A. Thus, Yi = Y 2 and 
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condition (ii) holds. Finally, consider the mapping rri\_m 2 ■ (x,y) £ m\_m 2 
implies that x and y have the same number of rows. For each A-column of 
length k, we can construct a list of k B-values and a list of k C-values. Thus, 
Imr-msl = Y,k-\A\ k • \B\ k • \C\ k = £fc- (\A\ ■ \B\ • \C\) k = |«T[A,B,C]»|. 
By Theorem 4.2.3, m,m_rni,m_m 2 is a valid result of Merge. ■ 

4.2.5 Diff Operator 

The operator Diff is complementary to Extract. It takes a model m and a map- 
ping m_m' between m and some model m ! , and returns a subordinate model 
rrid of m that does not “participate” in the mapping. To explain the intuition 
behind Diff, we continue with the scenario presented in Example 4.2.6. 

Example 4-2.17. Let m be the legacy database schema from Example 4.2.6. 
The legacy database has been migrated to a new operational database with 
schema m x (see Fig. 4.7). Assume that due to data protection regulations, 
all data in the legacy database has to be preserved indefinitely. For efficiency, 
the legacy data is split between the new operational database and an archi- 
val database. Our goal is to develop a schema ro^ for the archival database 
such that Hid captures only the information needed to reconstruct the legacy 
database from the new operational database and the archive, and no other 
information. In addition, we need a database transformation m_m,d that al- 
lows us to populate md with data from m. Together, the transformations 
m_md and m_m x describe how the data in the new operational database 
relates to the data in the archive. The correspondence m^md is defined by 
composition as m x _md = lnvert(m_m a: ) om_rrid- The legacy database can be 
reconstructed by merging m x and md under the mapping ■ 



new schema 




Fig. 4.7. Schematic representa- 
tion for Example 4.2.17 (Diff) 



The formal definition of Diff is given below: 

Definition 4.2.5 (Diff). Let Domain{m_m') C m. ( md,m_md ) = 

DifF(m , m_m') holds if and only if the following conditions are satisfied 
for some m x ,m_m x : 

i. ( m x ,m_m x ) = Extract(m,m_m') . 

ii. (m, m_m x , m_md ) = Merge{m x , m^, lnvert(m_m x ) o m_md )■ 

in. md is a minimal model satisfying (i) and (ii). M 
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Operators Extract and Diff are defined in such a way that for a given pair 
of instances x G m x and d G rrid we can reconstruct uniquely the instance 
y G m from which x and d were obtained by means of m_m x and m_rrid ■ If 
we use the same inputs for Extract and Diff, we get what we call a split (see 
illustration in Fig. 4.8). Mapping m_m which splits the model m into m x 
and rrid, is called the wedge mapping of a split. 

Corollary 4.2.1 (Split). Let (m x ,m_m x ) = Extraction, map) , ( md,m_rrid ) 
= DifF(m,map) and m x _md = lnvert(m_m x ) o m_md ■ Then , we can recon- 
struct all of to, m_m x , and m_md up to isomorphism from m x , md, and 
m x _md ■ More precisely, {m,m_m x ,m_md) = Merge(m x ,md,m x _md) holds. 

Proof: follows from Definition 4.2.5. ■ 

Definition 4.2.5 specifies a quadrary predicate. Whether it holds or not is 
hard to verify formally for fixed mj, m_md , m , m_m' , since conditions (i) and 

(ii) use three complex operators Extract, Merge, and Compose, while condition 

(iii) requires examining all possible models md for which there exist m x , 
m_m x with (i) and (ii). For instance, consider Fig. 4.8. The figure shows a 
valid result of applying the operator Diff to the model ro and mapping m_m! 
of Fig. 4.4. However, the fact that conditions (i)-(iii) are true is not obvious. 




Therefore, before giving detailed examples of Diff, we present Theo- 
rem 4.2.5, which gives an alternative characterization of Diff that can be 
verified easier. The theorem shows how we can reformulate the definition 
without considering the results of extraction m x ,m_m x explicitly. We cha- 
racterize Diff using the definition of ind (., ., .) from Sect. 4.2.3. Among other 
things, the theorem states that Diff makes all instances of m that are not 
distinguishable under m_m! distinguishable using m_md ■ Its proof is in Ap- 
pendix B. 
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Theorem 4.2.5 (Simplification of Diff). Let Domain(m_m') C m. 

{ m d ,m_m d ) = DifF(m,m_m r ) holds if and only if the following conditi- 
ons are satisfied: 

1. m_md is a surjective function from m onto md- 

2. For all y\,y 2 € Domain(m_m') with y\ yf y 2 and ind(yi,y 2 ,m_m') there 
exist (yi, df), (y 2l d 2 ) £ m_m d with d\ yf d 2 . 

3. If y £ m — Domain{m_m'), then there exists ( y,d ) £ m_md and 
id' I ( V ', d) £ m_m d } = {y}. 

4- \m d \ = diffCard(m, where difFCard(m, ttlto') =df max{\c\ : c £ 

II U {0} , | c| ^ 1} + | m — Domain{m^_m')\ and II is a partitioning of 
Domain(rn_m') by ind(., Ifm_m' is total, diffCard(m, m_m!) = 

max{ \c\ : c € II U {0}, |c| ^ 1}. ■ 

Condition (2) ensures that the instances of m that are indistinguishable in 
m rri' become distinguishable in m_m d . Condition (3) requires each instance 
of m that does not participate in m_m' to have a counterpart in m d that is 
not connected to any other instance of m. It ensures that Diff picks up the 
instances of m that get lost upon extraction. Condition (4) makes the result 
of Diff minimal. 

We illustrate the operator Diff using the following examples. 

Example 4-2.18. Consider again the model m and mapping m rn' of Fig. 4.4. 
Now, to verify the result, we do not need to construct an auxiliary mo- 
del obtained by Extract as we did in Fig. 4.8. We use Theorem 4.2.5 in- 
stead and depict the result of applying the operator Diff directly in Fig. 4.9. 
m_m d is a surjective function so that condition (1) holds. Instances y 2 
and 2/3 are indistinguishable under m_m' , therefore y 2 and 2/3 are connec- 
ted to two distinct instances of m d to satisfy condition (2). Only one in- 
stance, 2 / 7 j of m is unconnected in m_m' , and according to condition (3) 
does have a unique image in m d . The partitioning II of Domain (m_m r ) is 
n = {{2/i}, { 2 / 2 , 2 / 3 }, { 2 / 4 }, { 2 / 5 }, {2/e}, { 2 / 7 }}- The largest equivalence class of 
17, { 222 , 2 / 3 }, has cardinality 2, while |m— Domain(m_m')| = |{j/ 7 }| = 1- Thus, 
difTCard(m, m_m') = 2 + 1 = 3 = \m d \- ■ 

Example 4-2-19. Fig. 4.10 illustrates that there may be multiple ways of as- 
sociating the instances of m with those of m d - ■ 



Example 4-2-20. Let 
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m in' 




Fig. 4.9. Example of Diff result by Theo- 
rem 4.2.5 




Fig. 4.10. The output mapping in Diff is not 
determined up to isomorphism 



Then, 

m d = «U(A)», 
m_rrid = «U=R» 

is a valid result of Diff. 

Proof: Instances of are obtained by projection, so m_rrid is a surjective 
function and thus condition ( 1 ) holds. Notice that ind(yi,y2,m_m') iff yi.S = 
z/2-S. Assume that y\, j/2 £ to, y\ yf j/2, and ind(yi,y2,m_m'). That is, yi.S = 
z/2 -S and y\ yf j/2, so yi, 3/2 must differ on R, i.e. , rq.R yf r/2-R- But then, d\ = 
j/i.R yf t/2-R = 0^2 , so that condition ( 2 ) holds. All instances of m participate 
in m_m ! , so that condition ( 3 ) is satisfied trivially. Finally, notice that m_m' 
is a total function. By Theorem 4 . 2 . 5 , diffCard(m, m_m') = max{ |c| : c £ 
II U { 0 }, |c| yf 1 } = max zem i\{y : y . S = z.T}\. Since for each z £ m' : 
\{y | y . S = 2.T}| = |«R(A)»|, so difTCard(m, m_m’) = |«R(A)»|. Hence, m d 
is minimal since \m d \ = |«R(A)>|, and condition ( 4 ) holds. 

To reiterate the intuition behind Diff, m d = «U(A)» is a minimal schema 
that allows us to reconstruct uniquely an instance y £ «R(A), S(B)» from 
two instances x £ «T(B)» and d £ «U(A)» that were previously obtained 
from y using the mappings m_m x and m_rrid produced by the operators 
Extract and Diff. ■ 
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Example 4-2.21. Let 

to = «R[A,B]», 
m! = «T[A]», 

m_m' = «SELECT A FROM R». 

Then, 

m d = «U[B]», 

m_m d = «SELECT B FROM R» 

is a valid result of Diff. (Recall that we use the square brackets to indicate 
multiset semantics for relational tables). 

Proof: Instances of m d are obtained using a SQL query, so m_m d is a surjec- 
tive function and condition (1) is satisfied. Under multiset semantics, SELECT 
A FROM R returns the same number of tuples as the number of tuples in R. 
Thus, ind(yi, y 2 , m_m') iff y x and y 2 agree on the ordered list of A values. 
Assume that yi,y 2 £ to, y\ ^ y 2 , and ind(y x , y 2 , m_m'). Although y x and y 2 
agree on A values, we have y x yf y 2 , so they must differ on the ordered list 
of B values. Hence, d\ obtained using SELECT B FROM y x . R differs from d 2 
obtained using SELECT B FROM y 2 .R., and condition (2) holds. All instances 
of to participate in m_m' , so that condition (3) is satisfied trivially. 

m_m! is a total function. Hence, by Theorem 4.2.5, diffCard(m, m_m!) = 
max' z&m \{y : (y,z) £ m_m'}|. We have to show that m d is equipotent with 
{y : ( y,z max ) £ a maximal equivalence class induced by ind (., ., .). 

We get such maximal class when z m ax £ «T[A]» is an infinite ordered list 
of A values. diffCard(m, m_m!) is the number of instances of «R[A,B]» that 
agree on A with z max . There are |«R[B]»| such instances (since there are 
finitely many finite lists of B values, there is a bijection from the set of 
infinite lists to the set of infinite and finite lists). Hence, m d is minimal 
because \m d \ = |«R[B]»|, and condition (4) holds. ■ 

Example 4-2.22. Let 

to = «R(ID,A,B)», 
m' = «S(ID,A)», 
m_m! = «S = 7tid,a(R-) > - 

Then, 

m d = «T(ID,B)», 
m_m d = «T = 7Ti D ,b(R)» 

is an invalid result of Diff. There is a “smaller” schema m' d that satisfies the 
definition of Diff. We construct m' d in the proof. 

Proof: m d and m_m d satisfy conditions (1), (2), (3), but not (4). m_m d 
is a surjective function onto m d since instances of m d are obtained by 
projection, so condition (1) holds. Notice that ind(yi,y 2 ,m_m') holds iff 
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ttid,a(2/i-R) = 7 Tid,a(J/ 2 -R)- Let yi,y 2 £ m be given with y x ± y 2 and 
ind(y x ,y 2 , m_m'). That is, yi 1 y 2 agree on the projection of ID and A values. 
Since nevertheless y x ^ y 2 , then y x and y 2 must differ on B values. Since ID 
is a primary key, 7 Tid,b extracts all values of column B, including the ones 
that differ. Hence, d x = 7Tid,b(2/i-R) 7 ^ 7 Bd,b( 2 / 2 -R) = d 2 and condition ( 2 ) 
holds. Since m_m! is total, condition (3) is satisfied trivially. 

However, the minimality condition (4) is not satisfied. Notice that m_rn' 
is a total function. By Theorem 4.2.5, diffCard (m,m_m') = max z( z m ' I {y '■ 
(?/, z) £ We obtain the maximal equivalence class c max of ind (., ., .) 

when z = z max has |ID| tuples, i.e. , all ID values are used in z max . We can 
construct all instances y with ( y,z max ) £ m_m' by adding to z max a column 
with arbitrary B values. There are |B | l ID l such columns. That is, must 
have fc=|B|l ID l instances to satisfy (3). «T(ID,B)» has however 2 k instances, 
since the length of table T may vary between 0 and |ID|. 

To obtain a correct result m' d and m_m d , we add to m d 
the constraint that the number of tuples in T equals the do- 
main size of ID. We get: m’ d = «T(ID, B), |T| = |ID|>, m_m' d = 
«T = 7Tid,b(R.) U {(id, & fixed) | id £ ID, id ^ 7Tid(R)}», i.e., we extend each 
7riD,B(y-R) to have |ID| tuples by assigning the value &fi xe d £ B to each 
unused ID value that is not already contained in 7 Tid(2/.R). ■ 

Example 4-3-23. Consider the same setting as in Example 4.2.21 but using 
set semantics. Let 

m = «R(A, B)», 

m' = «T(A)», 

m_m’ = «SELECTD A FROM R». 

Then, 

m d = «U(B)», 

m_m d = «SELECTD B FROM R» 
is an invalid result of Diff. 

Proof: m_m d violates condition (2). As a counterexample, consider y x = 
{(1,2), (3,4)} and y 2 = {(1, 4), (3, 2)}. Obviously, y x ± y 2 . However, 
ind(yi,y 2l m_m') holds, since 7 ta(2/i.R) = 7ta(2/2-R) = {1,3}. d x and d 2 with 
(yi,di),(y 2 ,d 2 ) £ m_m d are determined as d x = 7 Tb(?/i.R) = {2,4}, d 2 = 
7 Tb(?/ 2 -R) = {2,4}. That is, d x = d 2 and condition (2) does not hold. In other 
words, we cannot reconstruct uniquely an instance y £ m from the instances 
{1,3} and {2,4}. 

If «U(B)» is not a valid output, how else can we then describe the re- 
sult of Diff in this example? To do this, we find a maximal equivalence class 
Cmax of the disjoint decomposition II induced by ind(., ., .), just as in Exam- 
ple 4.2.22. c max is a set of ordered lists of B values whose length lies between 
|A| and |A| • |B|. Thus, a schema m! d = «T[B]; |A| < |T| < |A| • |B|» could 
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be a valid schema produced by Diff. The corresponding mapping m_m' d is ho- 
wever difficult to describe using a closed formula. We know that it exists due 
to Theorem 4.2.5. Note that schema «U(A,B)> with a bijection «U = R» 
satisfies (1), (2), and (3), but cannot be a valid result, since «U(A,B)» con- 
tains too many instances and violates condition (4). ■ 

Example 4-2.24- This example illustrates Diff when m_m! contains a join. It 
uses the setting of Example 4.2.10. Let 

m = «R(A,B), S(B,C)», 
m! = «T(A,B,C)s>, 
m_m! = «T = R X S». 

Then, 

m d = «P(A,B), Q(B,C); 7r B (P) n7r B (Q) = 0», 
m_m d = «CREATE VIEW P(A,B) AS 

SELECTD * FROM R WHERE 
B NOT IN (SELECT B FROM S), 

CREATE VIEW Q(B,C) AS 

SELECTD * FROM S WHERE 

B NOT IN (SELECT B FROM R)» 

is a valid result of Diff. 

Proof: Mapping m_m d is defined using a create- view statement, so condition 
(1) is satisfied. Two different instances yi,y 2 of «R(A,B), S(B,C)» map to 
the same instance z of «T(A,B,C)», only if yi and j /2 agree on the subset 
of joining B-values and the tuples of R and S that contain these B-values. 
Instances y\ and yi may differ only on those R and S tuples that contain 
non-joining B-values. These tuples are exactly those extracted in the above 
create-view statement, so that condition (2) holds. m_m' is a view on m, 
so that to = Domain (m_m/) and condition (3) is satisfied. m_m' is total. 
By Theorem 4.2.5, diffCard(m, m_m!) = max z&m > | {y : ( y,z ) € to_to'}|. In 
other words, diffCard(m, m_m') is the maximal number of instances y € 
«R(A,B), S(B,C)» that map to the same fixed instance z of «T(A,B,C)». 
This largest set is the set of all databases y in which the join condition is 
not satisfied for any tuple, i.e. , 7r B (R) fi7r B (S) = 0. All such databases map 
to z m ax = {0}- Thus, m d is minimal. ■ 

In the above examples we have seen that the definition of Diff is hard to 
satisfy due to the minimality condition (4), which makes seemingly correct 
results invalid. We consider this problem in more detail in Sect. 4.3. In the 
rest of this section, we prove a few important theorems used in the analysis 
of the model-management scenarios that we present. 



Theorem 4.2.6. If (m d ,m_m d ) = DifF(m,m_m') and lnvert(m_m') is a 
surjective function, then m d = 0 and m_m d = 0. ■ 
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The above theorem states that in the case when all of to participates in 
m_m ' the difference m,d is “zero”; by Proposition 4.2.6, Extract would have 
to pick up all information of to. 

Theorem 4.2.7. Let (m x ,m_m x ) = Extract(m,m_m') , (md,rn^jrid) = 
DifF(m : m_m') and m x _md = lnvert(m_m x ) o m_md- Then, lnvert(m_md) o 
m_m! = Invertim^rrid) o I nvert(m_m x ) o m_m' . 

Proof: By Proposition 4.2.2, Invert^m^TOd) = lnvert(TO_m<i) o m_m x . Thus, 
the right expression of the equality to prove becomes lnvert(TO a ^TOd) o 
lnvert(m_TOa;) o m_m! = lnvert(m_TOd) o m_m x o lnvert(m_m x ) o m_m! . 
By Definition 4.2.3, m_m x o In vert (nuns) o m_m! = m_m! . Therefore, 
lnvert(m x _m [ ;) o lnvert(TO_TOa;) o m_m' = lnvert(m_TOd) o m_m' . ■ 

4.2.6 Confluence Operator 

Let mapi C mi x TO 2 and map 2 C TOi x TO 2 be two mappings between 
models TOi and to 2 . mapi and map 2 could be partial mappings developed 
independently by two engineers, or could have been obtained by composition 
as e.g. mapi = m\_m a ° TOq_TO 2 , map 2 = toi_toj, o to^TO 2 - The Confluence 
operator, ©, “unifies” the two mappings: 

Definition 4.2.6 (Confluence, ©). 

mapi © map 2 =df (mapi Cl map 2 ) 

U {i x >y) £ rnapi | x Domain(map 2 ) A y ^ Range(map 2 )} 

U {i x >y) £ rnap 2 | x Domain(mapi) A y £ Range(map\)} 

The Confluence operator extracts the “submapping” on which mapi and 
map 2 agree and adds to it the correspondences between all those instances of 
TOi and m 2 that participate either only in mapi or only in map 2 - Obviously, 
confluence is commutative. 

Theorem 4.2.8. If Range(mapi) C Range{rnap 2 ) , then mapi © map 2 = 
mapi Cl map 2 ■ 

Proof: follows from Definition 4.2.6. ■ 

Corollary 4.2.2. If mapi and rnap 2 are both total or both surjective, then 
mapi © map 2 = mapi Cl map 2 ■ ■ 

Example f.2.25. Let 

toi = «R(A,B), S(B,C)», 
to 2 = «T(A,B,C)», 
mapi = «T = R XI S», 
map2 = «7Ta(R) = 7Ta(T)». 
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Mappings mapi and map 2 are both total, therefore, by Corollary 4.2.2, 
mapi © map 2 = map\ fl map 2 - Hence, 

mapi © map 2 = «T = R N S and 7ta(R) = 7ta(T)». 

Notice that mapi © map 2 is not total. I 

Theorem 4.2.9. If Domain{mapi) H Domain{map 2 ) = Range(mapi) fl 
Range(map 2 ) = 0, or equivalently, map\ o lnvert(map 2 ) = !nvert{mapi) o 
map 2 = 0, then mapi © map 2 = mapi U map 2 - 

Proof: follows from Definition 4.2.6. ■ 

Theorem 4.2.10. If lnvert(mapi) is injective or lnvert(map 2 ) o map 3 = 0, 
then mapi o (map 2 © maps) = (mapi o map 2 ) © (mapi o maps). ■ 

In general, however, the distributive law does not hold. To see that choose 
map i = {{x, y 1 ),{x,y 2 )}, map 2 = {(yi,zi),(y 2 ,z 2 )}, maps = {{y 2 ,z 2 )}. 
Then, mapi o (map 2 © map 3 ) = {(x, yx), (x, y 2 )j o {(y 1: zs), {y 2l z 2 )} = 
{(x, zi), (x, z 2 )}. In contrast, (mapi o map 2 ) © {mapi o maps ) = 

{(x,zi), (x,z 2 )} © {(x,z 2 )} = {( x,z 2 )}. 

The following theorem illustrates an important use case of the Confluence 
operator. 

Theorem 4.2.11 (Mirror Merge). Let (m,m_mi,m_m 2 ) = Merge 

(mi, m 2 , mi_m 2 ), and mi_n± and W 2 _n 2 be bijective mappings. Then, 
(n,n_ni,n_n 2 ) = Merge{n\, n 2 , lnvert{m\_ni) o m\_m 2 om 2 ^.n 2 ) such that 
m_n = (m_mi o m 3 _ni o lnvert{n_n\)) © (m_m 2 o m 2 _n 2 o lnvert(n_n 2 )) 
is a bijection. Furthermore, n_n\ = !nvert{m_n) o m_m\ o m\_n\ 
n_n 2 = !nvert{m_n) o m_m 2 o m 2 —n 2 . 

Proof: Models m and n are equipotent since they are obtained from isomor- 
phic pairs of models. The non-trivial part of the theorem is the construction of 
the bijection m_n using Confluence. We sketch the proof in Fig. 4.11. The key 
observation is that either the “upper” path over m_m\ omi_n\ o lnvert(n_rii) 
or the “lower” path over m_m 2 o m 2 _n 2 o Invert(n_n 2 ) allows obtaining a 
unique image y £ n for each instance x £ to. For example, although the re- 
lationship between the instances x\,x 2 and y\,y 2 in the figure is ambiguous 
in the lower path, we can restore it by following the upper path. ■ 

4.2.7 Match Operator 

Match returns a mapping m\_m 2 that describes how the instances of m\ 
and m 2 relate to each other. Often, we can find infinitely many semantically 
valid mappings between two models. Each of these mappings makes sense in 
a specific application context. Therefore, we do not put any restrictions on 
the result of Match other than it is a mapping between m\ and m 2 : 
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_ w\zzij ^ 

lx1 ;^ © = *-« 
# y3 <^4^ 



Fig. 4.11. Illustration of Theorem 4.2.11 (Mir- 
ror Merge) 



Definition 4.2.7 (Match). mi_TO 2 = Match(mi,iri 2 ) holds if and only if 
mi_TO 2 C mi x m 2 - ■ 

Computing the result of Match requires human input, so that it can be 
considered a blocking, black-box operator. 



4.3 Materialization 

To make the formalization presented in this chapter useful for real applicati- 
ons, the results of model-management scripts need to be computed effectively. 
As we explained in Sect. 4.1.4, computing the results of a script amounts to 
finding a variable substitution that satisfies the script. We showed using ex- 
amples that it is often impossible to find the desired variable substitution due 
to the limited expressiveness of concrete schema and mapping languages. In 
this section, we explain how we can extend the scripts in a controlled fashion 
so that computing the results and deploying them in applications becomes 
possible. We refer to this problem as materialization. 

The intuition that we exploit is that it is acceptable to generate more ex- 
pressive models and mappings as long as we can reconstruct the exact results 
if necessary. First, we discuss materialization of models using an example. 

Example ^.3.1. Imagine that a model m produced by a script can be exactly 
specified as 

m = «R(A,B), S(B,C); 7tb(R) = 7tb(S)» 

using the classical relational model and the constraint expressed as a relatio- 
nal algebra expression. We argue that models 

«R[A,B], S[B,C]», 

«R[A,B], S[B,C]», 

«T[A, B, C]», 
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and some other more expressive models can be used as an adequate “ap- 
proximation” of to in SQL DDL. Each of these models dominates to, i.e., 
there is a total surjective function from m' onto to. For example, for m! = 
«T[A,B,C]» the function m'_m can be specified as 

m'_m = «CREATE VIEW R(A,B) AS 

SELECT DISTINCT A,B FROM T 

WHERE A NOT NULL AND B NOT NULL 
CREATE VIEW S(B,C) AS 

SELECT DISTINCT B,C FROM T 

WHERE B NOT NULL AND C NOT NULL» 

We can round-trip each instance of to to m! and back without information 
loss. For example, we can generate an instance of m! from an instance of to 
using the total injective function 

map = «CREATE VIEW T(A,B,C) AS 

SELECT A, B, NULL FROM R UNION 
SELECT NULL, B, C FROM S» 

It is easy to see that map o m'_m = ld(m). ■ 

To materialize to as m' we extend the script that defines to by adding to 
it the following conditions: 

lnvert(TO / _TO) o m'_m = Id (to); / / m'_m is surjective onto to 

to' = Domair^mUm); 

The above constraints on m' are quite weak, i.e., there are substantial degrees 
of freedom in computing m! . As we demonstrate below, materializing the 
mappings in which to participates places additional constraints on to/. 

If a model has been obtained from a relational schema in the script, then 
casting it into a relational schema may be a reasonable default assumption 
(in fact, in Rondo the result of merge is assumed to be a model of the same 
type as the input models). Alternatively, the target meta- model could be 
specified explicitly by declaring the “type” of the model variable m! as say 
SQL DDL. We expect that the tuning knobs and implicit policies of the 
model-management environment, such as syntactic minimality requirements 
on schemas or efficiency (Cosmadakis and Papadimitriou 1984; Spaccapietra 
and Parent 1994), can be deployed to drive materialization. 

Next, we discuss materialization of mappings. Consider the setting of 
Fig. 4.12(a). Assume that model mi is expressed using a constant, e.g., mi 
is a fixed relational schema, while model TO 2 and mapping map are defined 
using a script, say as results of change propagation. Our goal is to materialize 
to .2 and map. For TO 2 we can proceed as above: we materialize m 2 as a more 
expressive model m' 2: as witnessed by a total surjective function m' 2 _m 2 - 
Now, mapping map needs to be materialized as a mapping m\_m! 2 between 
mi and m' 2 . To be able to reconstruct the original mapping map , we require 
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(a) (b) 




Fig. 4.12. Materializa- 
tion of models and map- 
pings 



that map = o m’ 2 _m 2 - That is, m and map are materialized as m! 

and mi_m' 2 such that 

Invert(m2_m2) om 2 _m2 = Id(m2); 

777,2 = Domain(m 2 _TO 2 ); 

map = m\_m' 2 o m' 2 _m 2 \ 

To illustrate, consider Example 4.2.22. There, we showed that a quite in- 
tuitive result of Diff was invalid because it violated the minimality condi- 
tion. For convenience, we restate the example: let m = «R(ID, A,B)», m! = 
«S(ID, A)», and m'_m = «S = 7Tid,a(R) > - Then, md = «T(ID, B)», m<i_m 
= «T = ttid.bIR) 3, is an invalid result of Diff. However, we can demonstrate 
that this result is a valid materialization of the exact result of Diff. 

The materialization constraints suggested above offer substantial degrees 
of freedom for choosing mi_m 2l but they do not guarantee the existence of 
mi_m 2 (nor that of 7772 ). For example, if map is not a function but our target 
mapping language is a view definition language, we cannot materialize map 
as a view m\_m! 2 , since we know that a composition of two functions must 
yield a function. 

Just as in the case of schemas, various tuning knobs can be used to steer 
materialization of mappings. One metric is view minimization (Ullman 1997). 
Efficiency is another example: the work in (Shanmugasundaram et al. 2001b; 
Bohannon et al. 2002; Fernandez et al. 2002; Fan et al. 2003) presents various 
algorithms that make the mappings between relational and XML data more 
efficient, while Theodoratos et al. (2001) consider various metrics for selecting 
views in data warehousing, such as query evaluation cost, view maintenance 
cost, or storage space. 

As another example of materialization, consider Fig. 4.12(b). Here, m\ is 
not a constant but needs to be computed as well. We materialize mi, m 2 , 
and map respectively as rn \ , m 2 , and m\_m! 2 such that: 

lnvert(m , 1 _mi) om[_mi = ld(mi); 

777.1 = Domain(m'i_mi); 

Invert(?77.2_m2) om 2 _m2 = Id(m2); 

777.2 = Domain(m 2 _m 2 ); 

map = lnvert(mi_mi) o mi_m 2 o m 2 _m 2 ', 

In a general case, we are given a set of model and mapping variables with 
constraints established by a script. We suggest that a valid materialization 
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should allow us to reconstruct the exact variable substitutions using a set 
of total surjective functions, one for each model, which can be composed 
with the materialized mappings to obtain the exact mappings. It is sufficient 
to ensure the existence of such total surjective functions; they may not be 
expressible in concrete mapping languages. 

In an indirect way, materialization lifts constraints on models and map- 
pings. Since the presence of such constraints, e.g., of integrity constraints, 
may be essential for applications, constraint lifting should be done in scripts 
explicitly, as we suggested in the above discussion. The dropped constraints 
have to be maintained by applications to ensure that the materialized 
schemas keep only those database instances that would be representable 
in the exact schemas. That is, in Example 4.3.1, the application would 
need to maintain the constraint 7Tb(R) = 7Tb(S) for the materialization 
«R[A,B], S[B,C]», or the constraint: if A is NULL then B and C NOT 
NULL; if C is NULL then A and B NOT NULL, for the materialization 
«T[A,B,C]». An approach to automating application-based constraint 
management is discussed in (Peckham et al. 1995). 

We expect that there may be more than one approach to materializing 
schemas and mappings. For instance, in (Fagin et al. 2003) the authors ex- 
amine the problem of data exchange, which can be viewed as a problem of 
materializing a non-functional mapping between two models as a view, i.e., 
as a function. In this case, the exact mapping could be reconstructed using 
a different kind of composition, which allows obtaining the so-called certain 
answers for queries. In general, materialization can be stated as a constraint 
satisfaction problem, as exemplified in Sect. 10.4. 

Materialization of schemas and mappings is associated with information 
loss, since the total surjective functions utilized for materialization may them- 
selves not be expressible in concrete mapping languages. Therefore, materia- 
lization should only be done for models and mappings that need to be de- 
ployed by applications, but not for the intermediate results of scripts. As a 
matter of fact, in (Buneman et al. 1992) the authors found that materializing 
the intermediate merge results leads to information loss due to the limited 
expressiveness of the schema language they used, so that merging becomes a 
non-associative operation. To fix the problem, they introduce a more expres- 
sive auxiliary language and materialize the schemas only at the very last step. 

In general, it may be beneficial to preserve the original models and map- 
pings, and the scripts used to obtain the intermediate materialized result. 
In this way, the exact results of scripts are available for later use in other 
scripts. In addition, keeping the original inputs and scripts facilitates migra- 
tion to new standards and data management systems. For example, should 
we at some point of time decide to use a more expressive language for our 
schemas, e.g., XML Schema instead of SQL DDL, and migrate our data to 
a new DBMS, we may be able to compute “tighter” models for our results, 
i.e., the schemas may be materialized more accurately, with extra integrity 
constraints that were not expressible earlier. 
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“Nothing endures but change.” 

- Heraclitus (540-480 BC) 



In this chapter, we revisit the change propagation scenario. We present a solu- 
tion for this scenario using the operators of Chap. 4. We argue the correctness 
of our solution by examining several special cases and by showing that the 
scripts that we developed match the intuition in these special cases. We can- 
not formally prove that our solution is correct. To do so, we would need a 
formal specification of what change propagation means. Instead, we argue 
that the script that we present provides a major part of such specification 
in first place. This specification can be used to drive and verify implemen- 
tations for concrete schema and mapping languages. In fact, in Chap. 6 we 
examine to what extent the implementation presented in Chap. 2 conforms 
to the specification of change propagation presented below. 

A general outline of the change propagation scenario is the following (see 
Sect. 2.1 for a more detailed discussion). Assume that we are given models 
Si and d\ 1 and a mapping Si_di between them. Now, si changes into S 2 - The 
changes are described by the mapping Si_S 2 , which may have been obtained 
by matching si and S 2 - We want these changes to be propagated to di, i.e., 
we look for a model g ?2 and a new mapping S 2— ^2 that describes how the new 
model ^2 relates to s- 2 - 

We begin with a simple variation of the change propagation scenario and 
work our way towards a general case. First, we consider additions only in 
Sect. 5.1. Then, we consider deletions in Sect. 5.2, present a general solu- 
tion in Sect. 5.3 and examine schema evolution as a special case of change 
propagation in Sect. 5.4. We conclude the chapter and discuss several other 
possible variations of the change propagation scenario in Sect. 5.5. 

Before we dive into the scenario, one note is due. What we call “addition” 
corresponds to an abstract model modification that extends the set of possi- 
ble instances, i.e., it adds information capacity to the model. Addition may 
be caused by adding new elements to the model or by dropping certain exi- 
sting constraints. Similarly, “deletion” is another abstract manipulation that 
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reduces the set of possible instances, i.e., produces a less expressive model. 
Deletion may be caused by removing certain model elements or by adding 
constraints to the model. 



5.1 Propagating Additions 

Consider the schematic representation in Fig. 5.1. We continue using the 
convention that the names of the mapping variables identify the left and 
right models participating in the mappings. Thus, si_s 2 C si x s 2 . 

Assume that si is transformed into a more expressive model s 2 , such that 
lnvert(si_s 2 ) is a view on s 2 . The direction of the arrow labeled si_s 2 in the 
figure indicates that the instances of si are functionally determined by the 
instances of s 2 . To propagate additions from Si to d\ means to construct a 
model d 2 that can express all information of g?i plus all extra information of 
s 2 . The extra information of s 2 , which is not captured by si, can be obtained 
using the operator Diff. The resulting model is then merged with the model 
d\. This approach is described in the following script: 




Fig. 5.1. Propagating additions 



1. (s 2 ,S2-S 2 ) = DifF(s 2 , lnvert(si_s 2 )); 

2. di_s' 2 = lnvert(si_di) o si_s 2 o S2—s ' 2 ; 

3. (d 2 ,d2_ di,d2_ s 2 ) = Merge(di, s 2 , di_s 2 ); 

4. s 2 _d 2 = (lnvert(s!_s 2 ) o S\_d\ o lnvert(d 2 _di)) © 

(s 2 s 2 o lnvert(d 2 _S2)); 

To argue the correctness of the above solution, we consider the following 
special case: 

Proposition 5.1.1. If lnvert(si_s 2 ) is a surjective function and s±_di is a 
bijection, then s 2 ^_d 2 is a bijection too. That is, if si and d\ are equipotent 
and we add a certain amount of information to s\, then our script ensures 
that the same amount of information is added to di to obtain d 2 . 

Proof: The idea of the proof is to show that merging S 2 an d d\ produces a 
result equivalent to merging of S 2 an d the model m obtained by extraction 
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from 52 ■ The auxiliary mappings and model m that are used in the proof are 
represented using dotted lines and rectangles in Fig. 5.1 and are highligh- 
ted below in bold; all other variables are defined in the script or are input 
variables. 

To construct the alternative merge, we proceed as follows. Let 
(m, s 2 m) = Extract(s2, lnvert(si_s 2 )) and m s 2 = Invert(s2_m) o S2_s 2 . By 
Corollary 4.2.1 we get (s 2 , S 2 _m, S 2 _s 2 ) = Merge(m, s 2 , m_s 2 ). 

Let Si_m = si_S 2 o S 2 _m. By Proposition 4.2.7, s\_m is a bijection. 
Therefore, m_di = lnvert(si_m) o si_d\ is also a bijection. Thus, we have a 
Merge of m and s 2 and a Merge of d\ and s 2 , where m and d\ are equipotent. 
By Definition 4.2.3, lnvert(si_S2) = S2_m o Invert(s2_m) o lnvert(si_s 2 ) = 
S2_molnvert(si_m). Hence, by Proposition 4.2.2, Si_S2 = S]_molnvert(s2_rn). 
Notice that (si_m o Invert(s2_m)) o S2_s 2 = S\_m o m_s' 2 . Therefore, Si_S2 o 
52 52 = si_m o m_s 2 , and we obtain that di_s 2 = lnvert(?n_di) o m_s 2 . 

Now, we are ready to apply Theorem 4.2.11, which entails that d 2 is 
isomorphic to s 2 with a bijection (s 2 mom diodi d 2 )(P)(s 2 s 2 olnvert(d 2 s 2 )) 
between them. 

The second part of the above expression is equivalent to that of the last 
line of the script, line 4, which defines S2_d2- To obtain the proposition, 
all we need to show is that S2_m o m_d\ = lnvert(si_S2) o S\_d-\ . We have 
demonstrated above that lnvert(si_S2) = s 2 _mo lnvert(si_m). By composing 
both parts of the expression with Si_di, we obtain lnvert(si_S 2 ) o si_di = 
S2 m o lnvert(si_m) o si_di = 5 ? m o m_d\. ■ 

Analogously, we could show that if si_di is a view on Si, then S2—d 2 is a 
view on S2. For that, we would need an extension of Theorem 4.2.11, which 
uses surjective functions instead of bijections. 



5.2 Propagating Deletions 

Consider the scenario of Fig. 5.2. Assume that si is transformed into a less 
expressive model s 2 , such that si_S 2 is a view on si. Propagating deletion 
from Si to d± means that we discard all instances of d± that are not relevant 
for representing the information in S2. In other words, we keep the instances 
of d\ that are relevant for S2 and those that do not participate in si_di in 
the first place, since the latter are not affected by the change. This effect can 
be achieved using the following script: 

1 . di_s 2 = lnvert(si_di) o si_S2; 

2. (m,di_m) = Extract^!, di_S 2 ); // still in S 2 

3. (n,di_n) = Diff(di, lnvert(si_di)); // to keep in d\ 

4 . {d2 1 d‘ 2 ^_m,d‘^_n) = Merge(m, n, lnvert(d 1 _m) odi_n); 

5 . di_d2 = (d\_m o Invert(d2_m)) © (d\_n o Invert(d2— ^ n)); 

6 . S2—d 2 = lnvert(d 1 _S2) o c?i_d 2 ; 
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Fig. 5.2. Propagating deleti- 
ons 



In line 2, we extract the model to, which captures all information of S2 “vi- 
sible” through s‘i . In line 3, we determine the portion n of d\ that does not 
participate in s±_di, using the operator Diff. Model n represents the infor- 
mation that needs to be preserved in c ?2 no matter what the mapping Si_S2 
looks like. 

To substantiate the correctness claim for the above solution, we examine 
the following two special cases. 

Proposition 5 . 2 . 1 . If s±_S2 is a bijection, then di_d,2 is a bijection and 
S2—d2 is isomorphic to si_d\ . That is, if the information capacity of s i re- 
mains unchanged, so does that ofdi. 

Proof: Since si_S 2 is a bijection, so d\_S2 is isomorphic to lnvert(si_G?i). 
Hence, Extract and Diff in lines 2-3 form a split (see Corollary 4.2.1). By 
Theorem 4.2.11, di_d2 is a bijection. Therefore, by composition in line 6, 
S 2 —d 2 is isomorphic to si_di. ■ 

Proposition 5 . 2 . 2 . If S\_S2 is a surjective function and S\_di is a bijection, 
then S2—d2 is a bijection. That is, if si and d\ are equipotent and we delete a 
certain amount of information from s i, then our script ensures that the same 
amount of information is deleted from d\ to obtain d2- 

Proof: Since s±_di is a bijection, it is also a surjective function, and by 
Theorem 4.2.6, n = 0 and d\_n = 0. Followingly, to = ^2 and the above 
script is equivalent to the following script (with respect to d2, S 2 — <^ 2 ): 

di_S 2 = lnvert(si_di) o Si_S 2 ; 

(d 2 ,di_d 2 ) = Extractor, di_s 2 ); 

52—^2 = lnvert(di_s 2 ) 0 d\_d 2 ; 

This script is shown schematically in Fig. 5.3. 



s 




1 



Fig. 5.3. Propagating deletions over bijection 
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By construction, d±_S 2 is a surjective function. By Proposition 4.2.6, S 2 —d 2 
is a bijection. ■ 

Notice however that if s\_d\ is a view on si then the resulting mapping 
S 2 —d 2 need not be a view on S 2 ■ At first glance, this result does not seem to 
match our intuition. As we illustrate in the following example, the reason is 
that deletion is not limited to removing attributes from si, but could have a 
variety of other causes, such as restricting the domain of an attribute using 
an arithmetic function. 

Example 5.2.1. Let 

si = <eR[val : byte]», 

d\ = «S[val : byte(0..128)]», 

Sl _di = «SELECT val / 2 FROM R» 

Assume integer division and multiset semantics for schemas. Let the deletion 
be specified by the mapping 

Si_s 2 = «SELECT val / 3 FROM R». 

s\_d\ is a surjective function, so the propagation of deletion is equivalent to 
the 3-line script used in the proof above. The mapping di_s 2 obtained by 
composition can be specified as follows: (y, x) £ d 1 _s 2 iff x and y have the 
same number of tuples and for each i-th tuple t x of x and t y of y the following 
condition holds: (f y .val = f x .val • 2/3 or t y .v al = (ir.val • 2 + l)/3). The 
extracted schema d 2 is obtained as d 2 = «T(val : byte[0..128])», where d\_d 2 
is a bijection. Thus, the mapping ,s 2 _d 2 is isomorphic to lnvert(c?i_s 2 ), i.e., 
can be expressed using a similar condition as above. s 2 _d 2 is not a function. 
In fact, the instance x = {4} £ s 2 maps by way of s 2 _d 2 to two instances of 
g? 2 , yi = {2} and y 2 = {3}, since 4 • 2/3 = 2 and (4 • 2 + l)/3 = 3. ■ 

Observe that the “deletion” done in the example is somewhat unorthodox. 
One can show that in simpler cases, in which s 2 is obtained by removing one 
or more attributes from s i, the result S 2 _d 2 remains a view. 

The example illustrates that certain changes of the source schema may 
break the view in such a way that it cannot be “repaired” fully automatically. 
Intervention of a human designer may be necessary to define an updated view, 
which otherwise becomes a non-functional transformation. 



5.3 A General Solution 

In a general solution for change propagation, which is depicted schematically 
in Fig. 5.4, we propagate the additions and deletions simultaneously. We 
combine the scripts of Sections 5.1 and 5.2 and obtain the following solution: 

1. di_s 2 = lnvert(si_di) o si_s 2 ; 

2. (to, di_m) — Extract(di, di_s 2 ); // still in s 2 
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(to add) 



Fig. 5.4. Change propaga- 
tion: a general solution 



3. (n,d\_n) = Diff(di, Invert(si_<ii)); //to keep in d\ 

4. (D, D_m, D_n) = Merge(77i, n, lnvert(di_m) o d\_n)] 

5. di_D = (di_m o lnvert(.D_m)) ® (d\_n o lnvert(.D_n)); 

6. S 2 —D = lnvert(c?i_S2) o di_D; 

7. ( 82 , 82 - 82 ) = DifF(s 2 , lnvert(si_s 2 )); 

8. D_s' 2 = lnvert(d 1 _D) o di_s 2 o ,s 2 s' 2 , 

9. (d 2 ,d'j-D, d/ 2 —s' 2 ) = Merg e(D, s 2 , D_s 2 ); 

10. s 2 _e? 2 = (lnvert(di_s 2 ) o d\_D o lnvert(c? 2 _Il)) ® 

(S 2 -S 2 o lnvert(d 2 _s' 2 )); 

The lines 1-6 correspond to the deletion script of Sect. 5.2, with the only 
difference that e? 2 is replaced by D. The remaining lines 7-10 deal with pro- 
pagating additions. Line 8 can be simplified as D_s 2 = lnvert(s 2 _H) o S 2 —S 2 
by exploiting the result computed in line 6. 

By construction, the above script is equivalent to either the deletion or 
addition script if si_s 2 or lnvert(si_s 2 ) is surjective function, respectively. 
In fact, if si_s 2 is a surjective function, then s' 2 — 0 and the script can 
be rewritten as the deletion script of Sect. 5.2. If lnvert(si_s 2 ) is a surjec- 
tive function, then Extract(di, lnvert(si_di) os] s 2 ) yields the same result as 
Extract^, Invert(s!_c?i)). Hence, m and n form a split over the wedge mor- 
phism lnvert(si_di), and d\ = D by Corollary 4.2.1. Thus, in this case the 
script can be rewritten as the addition script of Sect. 5.1. 

We suggest that the above script describes the intended state-based se- 
mantics of change propagation. To support this claim, we have shown that 
its semantics matches the intuition in the special cases discussed above. 



5.4 Schema Evolution Scenario 

The schema evolution problem arises when a change to a database schema 
breaks a view that is defined on it. The schema evolution scenario is a special 
case of the change propagation scenario of Sect. 5.3, when si_e?i is a to- 
tal surjective function. In this case, by Theorem 4.2.6, Diff(c?i, lnvert(si_di)) 
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yields an empty model n, and the model D in Fig. 5.4 is equipotent with m. 
Thus, the solution of Sect. 5.3 can be simplified as illustrated in Fig. 5.5. The 
respective script is shown below: 




(to add) 



Fig. 5.5. Schema evolution: a 
special case of change propaga- 
tion 



d\_S 2 = lnvert(si_di) o si_S 25 
( m,di_m ) = Extractor, di_S 2 ); 

(4,32-4) = Diff(s 2 , lnvert(s 1 _s 2 )); 
s' 2 _m = Invert(s2_4) ° Invert^— S 2 ) o d\_m\ 

(4,4-32,4_m) = Merged, m, 4- m ); 

32—4 = (lnvert(di_s 2 ) ° di_m o lnvert(4— m)) ® (s 2 —s' 2 o lnvert(4— S 2 )); 

The resulting mapping S2_4 corresponds to the updated view definition, 
whereas d 2 is the updated schema. As we demonstrated in Sect. 5.2, S 2 _d 2 
may not be a function, i.e., certain changes to the database schema may make 
it impossible to define an updated view on the new schema without human 
decision-making. 

In many schema evolution scenarios involving SQL views, it may be desira- 
ble to use the above script even if Si_4 is a total function, but not surjective. 
In this case, model n is not empty, but can be safely ignored. This approach 
is justified by the fact that the SQL view definition language allows the view 
schema to have instances that are impossible to obtain by executing the view 
query over the source database. For example, consider the view defined as 
CREATE VIEW V(age: int) AS SELECT S.age < 20 FROM S. This view is not 
surjective onto the view schema «V(age : int)», since the view schema fails 
to capture the constraint that age values never exceed the value 20. However, 
due to this constraint the difference schema n = «V(age : int, age > 20)» 
has no instances that can be obtained by executing the view definition and 
can be safely ignored in the result. 
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5.5 Variants of Change Propagation 

Change propagation is a very rich scenario. In this section, we highlight se- 
veral aspects of it that need to be considered in future work, such as conver- 
sion, splitting, batching, chaining, etc. 

The state-based approach and the solution for change propagation that 
we presented above abstracts out the fact that models si, S 2 and d±, 
may be expressed in different schema languages. For example, line 9 of the 
script of Sect. 5.3 may describe a merge of a relational and an XML schema. 
In Chap. 2, we assumed that all operators return their results expressed 
in the same schema language as the input schemas. Therefore, an explicit 
conversion step is required in Rondo before we can merge a relational schema 
with an XML schema. In (Bernstein 2003) , a special operator called ModelGen 
was introduced to implement conversion. Ideally, conversion should return an 
equipotent model and a bijection between the input and the output model. 
In this case, conversion is a “transparent” operation in terms of state-based 
semantics and can be introduced at any step of a model-management script 
without changing its semantics. However, in many cases conversion is bound 
to yield a strictly more expressive model due to the limited expressiveness of 
the target schema language. In such cases, the state-based semantics of the 
script may differ depending on where the conversion step is introduced into 
the script. To illustrate, consider the following example. 

Example 5.5.1. Consider propagation of additions illustrated in Fig. 5.1. If 
we assume that conversion yields equipotent models, then the solutions in 
Fig. 5.6 and in Fig. 5.7 are equivalent to that shown in Fig. 5.1. In Fig. 5.6, we 
convert S 2 to c before applying the Diff operators, just as we did in Sect. 2.1 
(see Figures 2.2 and 2.3), using the expression (c, S 2 — c) = ModelGen(s2). 
In Fig. 5.7, we first apply the Diff operator and then convert, {c’,s' 2 ud) = 
ModelGen^). If the mappings S 2 _c and s' 2 _d are not bijections, the semantics 
of all three solutions may differ from each other. ■ 




Fig. 5.6. Addition only, convert first 
then Diff 
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Fig. 5.7. Addition only, Diff first, then con- 
vert 



The script of Sect. 5.3 presents one possible solution for change propaga- 
tion. No doubt, many other scripts may achieve the same effect as the one 
that we developed. However, due to the script complexity, it may be quite 
hard to determine whether two or more alternative solutions are equivalent. 
For example, in Fig. 5.4 we effectively specify a 3-way Merge between the 
models m, n, and s 2 ■ We do so by first merging m and n, and then merging 
the result with s' 2 . What if we first merge s 2 and m? 

In the script of Sect. 5.3 we assumed that all information added to si 
by means of S 2 is new and is not covered by d±. Thus, we simply merge 
D and s 2 using the mapping D_s 2 obtained by composition. However, in a 
most general case, D and s' 2 may have a greater overlap than that suggested 
by D_s 2 . This can happen when the information added to Si by way of s 2 
is already contained in d\. To determine that, we would need to match the 
portions of D and s 2 that do not participate in D_s 2 . These portions can be 
obtained using the operator Diff. 

Another question of great practical importance is whether we can propa- 
gate the changes in small fragments rather than all at once, and still obtain 
an equivalent solution. In fact, some change management tools face the choice 
of either propagating each atomic modification one by one or as a set of batch 
operations of larger granularity. Splitting or batching the modifications may 
be necessary for efficiency reasons, for example, if d± is a model of a code 
base of several million lines that needs to be updated. A related question 
is whether chaining of change propagation operations can be simplified. The 
chaining takes place when the changes propagated from one model to another 
trigger change propagation to a third, fourth, etc. model. 

We conclude this chapter by making the following remarks: 

— Change propagation is a complex scenario that we are only starting to 
understand. 

— Conversion needs to be modeled explicitly if it produces more expressive 
models. 

— An automated theorem prover could be instrumental in analyzing the equi- 
valence or subsumption of alternative solutions, or more elaborate change 
propagation scenarios. 
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“Although nature commences with reason and ends in experience 
it is necessary for us to do the opposite, that is to commence with 
experience and from this to proceed to investigate the reason.” 

- Leonardo da Vinci (1452-1519) 



In this chapter, we discuss the relationship between the structural opera- 
tor definitions given in Chap. 2 and the state-based definitions presented 
in Chap. 4. To avoid ambiguity, we refer to the operators and scripts used 
in Rondo (Chap. 2) as structural and to those of Chap. 4 as state-based. 
The discussion that we present illustrates how the behavior of Rondo and 
other complex metadata management systems can be analyzed in terms of 
state-based semantics. 

In Rondo, we use subsets of the standard schema languages such as SQL 
DDL or XML Schema. The state-based semantics for these languages is well- 
known. The definitions in Chap. 2 give a precise specification of the output 
schemas and morphisms returned by the structural operators. Hence, to de- 
fine the state-based semantics of the structural operators, we merely have to 
provide a formal specification of the state-based semantics for morphisms and 
selectors. Once we have the state-based semantics for models, morphisms, and 
selectors, we effectively obtain a precise state-based semantics for the struc- 
tural operators. Consequently, the state-based semantics of scripts containing 
a mix of state-based and structural operators becomes well-defined. We will 
use this observation in Sect. 6.3 to relate the structural operators to the 
state-based ones. 



6.1 Semantics of Morphisms 

Morphisms are a graphical language. The arcs that connect individual schema 
elements constitute the elementary expressions of this language. Although 
this language has been used in various tools, to our knowledge no previous 

S. Melnik: Generic Model Management, LNCS 2967, pp. 101-113, 2004. 
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work defined its semantics precisely. Moreover, it is likely that the semantics 
differs from tool to tool, i.e., there is no best definition that suits each use case. 
In this section, we discuss several alternatives for the meaning of morphisms 
and select one of them as our working interpretation. We focus on morphisms 
connecting the attributes of simple relational schemas under multiset (i.e., 
SQL) semantics without integrity or key constraints. In the case of complex 
schemas, such as relational schemas with constraints or XML Schema with 
nested types, the behavior of Rondo scripts is hard to characterize using the 
state-based operators. The reason is that the structural operators have been 
developed prior to the state-based semantics of morphisms that is presented 
in this section. Hence, the properties of the structural operators do not match 
exactly the requirements stated in Chap. 4. We illustrate this point in more 
detail in Sect. 6.3. 

Consider Fig. 6.1. It shows a simple morphism that connects the schemas 
m i and m 2 using three arcs. Observe that the schema m 2 appears to be a nor- 
malized representation of m\. The foreign key dependency between the tables 
S and T in m 2 is intentionally omitted, i.e., we do not know whether S.ID is 
a foreign key for T.ID or the other way around, or whether S.ID and T.ID 
are keys at all. We examine three interpretations of this morphism, denoted 
as mapi , map 2 , and map 3 . To give them names, we call these interpretations 
the tuple-list, tuple-set , and value-set interpretation, respectively. 




map, ci map 2 c= map 3 



tuple-list tuple-set value-set 



map{. 

^Namet^] — ^Name[^] 
^City,Zip[^] — ^City,Zip[T] 

map 2 : 

^NameC-^-) — ^Name(^) 
^City,Zip(^-) — ^City,Zip(T) 

map 3 : 

^Name(^) ^Name(^) 

^CityW = ^City(T) 

*zi P ( R ) = n zi P ( T ) 



Fig. 6.1. Three alternative 
semantics for a morphism 



The tuple-list interpretation mapi is motivated by version manage- 
ment. Assume that one of the schemas is a new version of the other 
schema. In this case, an instance of the new version of the schema can 
be obtained by copying the values from the instance of the old schema. 
Each arc indicates what attribute values are copied. For example, the 
arc connecting R.Name and S.Name indicates that the attribute values 
of R.Name are selected and inserted into S.Name, or vice versa. This 
relationship can be expressed as a constraint SELECT Name FROM R = 
SELECT Name FROM S, which we abbreviate as 7TName [R] = ttn™ [S] . 
Similarly, for the attributes City and Zip we obtain the constraints 
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TTCityp.] = 71-City [T] and 7r C ity[R] = 7r C ity[T], respectively. Thus, our first 
interpretation of the morphism yields a mapping that can be described as 
map 1 = «7T N ame[R] = 7TNa,me[S], 7T C ity [R] = ^City [T] , TT Z ip [R] = 7r Zip [T]». 
Notice that due to multiset semantics mapi is equivalent to 

^^Name^] = ^Name[S], ^"City.Zip [R I : ~Cit.y,Zip [T ] >> ■ 

Should ui 2 indeed represent a normal form for mp, then mapi does not 
characterize the mapping adequately. Copying all R.City and R.Zip tuples 
into T including the duplicate ones is not what we want; we are interested 
in distinct tuples only. That is, the tuple- list interpretation is too restrictive: 
it is consistent with the semantics of normalization only when (R.Name) 
and (R.City, R.Zip) are keys in mi. To embrace the more general case, we 
examine another possible interpretation for morphisms, the tuple-set inter- 
pretation. For the morphism of Fig. 6.1, the tuple-set interpretation is map 2 
= «7TName(R) = 7T N ame(S), 7T C ity ,Zip (R) = 7TCity,Zip(T)». An expression Such 
as 7rcity.zip (R), with parentheses instead of brackets, refers to the set of tu- 
ples City, Zip selected from R. Observe that map 2 holds even if (R.Name), 
(R.City, R.Zip) are not keys in R. 

In the tuple-set interpretation, we assume that whenever two 
or more lines connect the attributes of two tables, the relations- 
hip between the respective attribute values remains preserved, as in 
7Tcit y ,zip(R) = 7TCity,zip(T). This assumption turns out to be too strong 
for the morphisms used in Rondo. To illustrate, consider Fig. 6.2. The 
figure depicts the composition of two simple morphisms, whose seman- 
tics is mi — m 2 = «7TCity(R) = 7r C it y (S), 7T Z ip(R) = 7r zip (T)» and 77i2_m 3 = 
«7TCity(S) = 7Tcit y (U), 7r Z ; p (T) = 7r Z j p (U)». Observe that the relationship 
between cities and zip codes in m 2 is vacuous. That is, if we associate 
the instances of mi and m 3 by way of mappings m 3 _m 2 and m 2 m 3 , we 
cannot expect the relationship between cities and zip codes to be pre- 
served in the composed mapping m±_m 2 o m 2 — m 3 . In other words, the 
constraint 7rcit y ,zip(R) = 7TCity,zip(U) is not guaranteed to be satisfied gi- 
ven the constraints imposed by the mappings mi_m 2 and m 2 — m 3 . For 
example, consider the instances i\ = «{ (Seattle, 001), (Berlin, 002) }», i 2 = 
^{Seattle, Berlin}, {001,002}», i 3 = «{ (Seattle, 002), (Berlin, 001)}». 
Although (*i,* 2 ) £ mi_m 2 and (* 2 ,« 3 ) £ m 2 — m 3 , the constraint 

7TCity,zip(R) = 7rcity,zip(U) does not hold for the instances (*i,* 3 ). In con- 
trast, the constraint 7rcit y (R) = 7rcity(U), 7r Z j p (R) = 7r Z i p (U) does hold. It 
corresponds to our third interpretation, the value-set interpretation map 3 of 
Fig. 6.1. 

The interpretations that we considered relate to each other as follows: 
mapi C map 2 C map 3 . We argued that mapi and map 2 do not cha- 
racterize the semantics of Rondo morphisms correctly. A weaker interpre- 
tation map 3 is a compromise and we will use it as our working assump- 
tion for morphism semantics. A number of alternative interpretations of 
morphisms could be constructed by e.g. combining the set-based and list- 
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rn\_m 2 * /?? 2_ m 3 



City 

Zip 



U 

City 

Zip 



Fig. 6.2. Relationship between cites and zip codes 
is not preserved on composition 



based projection or using the subset operator instead of equality, as in 
«7TName[R] C 7r Na me [S] , 7r C ity,zip (R) = 7rcity,zi P (T)», etc. However, since the 
morphisms in Rondo are plain sets of arcs, such variants cannot be distinguis- 
hed in the syntax, i.e. , in the graphical representation; among other things, 
the interpretation of morphisms needs to be symmetric with respect to mi 
and m 2 . These variants could be used as another useful mapping language in 
future work. 

The value-set semantics is broad enough to subsume a number of useful 
transformations. Some of them are listed below (referring to the schemas of 
Fig. 6.1). 

S M T — > R: 

R = SELECT [DISTINCT] Name, City, Zip 
FROM S,T WHERE S.ID=T.ID 

R — >■ S,T: 

S = SELECT DISTINCT Name, Sk(Name) FROM R; 

T = SELECT DISTINCT Sk(City, Zip), City, Zip FROM R 

R-^SNT: 

S = SELECT DISTINCT Name, Sk(Name) FROM R; 

T = SELECT [DISTINCT] Sk(Name), City, Zip FROM R 

R-lTMS: 

S = SELECT [DISTINCT] Name, Sk(City, Zip) FROM R; 

T = SELECT DISTINCT Sk(City, Zip), City, Zip FROM R 

For example, the first SELECT clause, labeled with S IX T — > R, represents 
two transformations obtained by including or omitting the DISTINCT sub- 
clause; they define mi as a view on m 2 - The remaining five transformations 
define S and T in terms of R. Sk () denotes a Skolem function. For instance, 
in R — > S XI T, S.ID is intended to be the primary key in S, whereas T.ID is 
the foreign key. In fact, since S contains distinct tuples only and the attri- 
bute Name is in the domain of the Skolem function used to compute S.ID, so 
the functional dependency S.ID — > S.Name holds. The inclusion dependency 
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T.ID C S.ID is satisfied because T is generated from the same relation R 
using the same Skolem function. In R — > T IX S, T.ID is the primary key in T 
while S.ID is its foreign key, i.e. , T.ID — > {T. City, T. Zip} and S.ID C T.ID. 
In R — > S,T the relations S and T are obtained as independent collections 
of names and addresses. All of the above view definitions are contained in 
map 3 , i.e., for each view v we have v C map 3 . 

In the subsequent sections we assume the value-set semantics for mor- 
phisms: each arc in a morphism establishes equality of the value sets of the 
connected schema elements and is independent of other arcs. Thus, mor- 
phisms do not describe structural or value transformations of schemas. They 
contain no joins and no WHERE clause. Although the subset of morphisms 
that we discuss is a fairly weak mapping language, it is instrumental for de- 
scribing the relationships between schemas when the exact transformation is 
not known. For example, in Clio (Popa et al. 2002), the set of initial corre- 
spondences between the elements of two schemas is represented graphically 
similarly to the morphism of Fig. 6.1. These correspondences are subsequently 
refined into precise view definitions using an elaborate user interface. Hence, 
a morphism could be seen as a “rough” mapping that covers a variety of 
possible view definitions. 

We define the semantics of the empty morphism between mi and m 2 as 
mi x m 2 . That is, the empty morphism does not place any constraints on the 
mapping between mi and m 2 . An empty relational schema contains exac- 
tly one database instance, the empty set. Finally, we assume that relational 
schemas may contain NULL values, which are treated as absent values. In 
particular, 7ta(R) is guaranteed to contain only non-NULL values and is the 
empty set when all R.A values are NULLs. We explain why this assumption 
is important when we discuss composition in Sect. 6.3. 



6.2 Semantics of Selectors 

A selector identifies a set of model elements. We define the state-based se- 
mantics of selector s as that of the corresponding identity morphism Id(s). If 
s contains a set of relation attributes, so Id(s) is a one-to-one correspondence 
between the attributes s of two identical relational schemas. The state-based 
semantics of such correspondence has been defined in the previous section. 

To illustrate, consider the model m 2 of Fig. 6.1. Let s = {S.Name, T.City, 
T.Zip} and let m' 2 be a model identical to m 2 . The state-based semantics of 
the selector s is that of the mapping s map = « 7 TName(wi 2 -S) = TTwameim^.S), 
7TCity(m2-T) = 7r C ity(m2.T), ^zip(^-T) = Trzip(m' 2 .T) :: m 2 :: m' 2 ». Notice 
that lnvert(s mop ) — Sm ap - 
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6.3 Structural vs. State-Based Operators 

In this section, we compare the state-based semantics of the structural ope- 
rators used in Rondo with that of the operators of Chap. 4. As we will see, 
the state-based semantics of some structural operators, such as Compose 
and Invert, is identical to that of the respective state-based operators. Other 
structural operators, such as Extract and Merge, return materializations of 
the exact results (compare Sect. 4.3), i.e., have a weaker state-based seman- 
tics. This fact is not surprising given that the definitions of the state-based 
operators Extract and Merge contain minimality requirements that are very 
hard to meet in concrete schema and mapping languages. We also show that 
one of the operators, Diff, produces results that violate the desired conditions 
that we postulated in Chap. 4; specifically, under the value-set interpretation 
for morphisms, Diff loses information. 

We summarized the signatures of the structural and state-based operators 
in Table 2.1 on page 23 and Table 4.1 on page 64, respectively. The structural 
operators are defined for models represented as directed labeled graphs and 
for a simple concrete mapping language, the morphisms. Selectors can be 
viewed as syntactic sugar; they can be replaced in all operator signatures by 
(identity) morphisms. In contrast, the state-based operator definitions apply 
to mappings expressed in arbitrary languages and do not rely on a particular 
representation of models and mappings. Although the signatures of the state- 
based operators such as Id, Domain, or Invert, are very similar to those of the 
respective structural operators, a key difference to keep in mind is that the 
mappings taken as parameters identify binary relations on instances rather 
than binary relations on model elements. 

We assume that in accordance with the operator signatures, the free va- 
riables used below range over simple relational schemas, value-set morphisms, 
and selectors (i.e., identity morphisms) rather than over arbitrary models and 
mappings, so that we do not have to quantify the variables in each expression. 
We start with the structural operator Compose and the state-based operator 
Compose, denoted as o. The state-based semantics of Compose is exactly the 
one specified in Definition 4.2.1: 

Compose ( mapi, map 2 ) = map\ o map 2 \ 

Fig. 6.3 shows three representative examples of composition, in the rows. The 
leftmost column contains the source schemas and mappings. The second co- 
lumn depicts the result of composition produced by the structural operator 
Compose. The third column presents the result that satisfies the conditions 
of the state-based operator Compose under the assumption that NULL values 
are allowed in relations (our working assumption). In the rightmost column, 
NULLs are disallowed. Observe that the morphisms shown in column 2 have 
exactly the semantics of the mappings in column 3. For instance, in the se- 
cond example structural composition yields an empty morphism. And indeed, 
for any two given instances i\ € m\ and is € m 3 it is always possible to find 
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an instance ii £ m 2 such that 7 ta(*i.R) = 7Ta(*2 -S) and 7 Tb( 23 .T) = 7 ta(*2 -S) 
by constructing a relation S whose A values are drawn from 7TA(d-R) and B 
values are drawn from 7 ^( 23 . T). Should either 7 ta(*i-R) = 0 or 7 ^( 23 . T) = 0 
hold, the respective attribute values can be simply filled with NULLs. Howe- 
ver, if NULL values are disallowed, which is the case in the classical relational 
model, then such an instance z 2 can only be found if either both R and T 
are empty, or both are non-empty. Hence, in this case composition yields the 
mapping «R = 0 f> T = 0> shown in column 4. The example illustrates that 
it is critical to specify the state-based semantics of schemas precisely, down 
to such details as whether NULLs are supported or not. The third example 
of Fig. 6.3 is analogous. 



Input morphisms 


Result of composition 


m x m 2 m2_m 3 


m x m 2 * m 2 _m- i 
(in Rondo) 


m x m 2 om 2 _m i 
(with NULLs) 


m x m 2 0 m 2 

(no NULLs) 


CD > ?0 

ms® \> []“■ 

| CD > H 


R T 

a] — pT 

bJ [b_ 


^a( R )=^a( T ). 

7t B (R)=Jt B (T) 


>'a( r )="a( t ). 

7Ib(R)=7%(T) 


R 

0h 


s 

A 

B 


T 

H® 


R T 

0 0 


empty 

constraint 


R=0<-> 

T=0 


r i 

R 2 x 

\sj 


a 

B 


Ti 

X T - 
® 


R, T, 

r 2 t 2 

HHb] 


7C A( R l)= 7t A( T l). 

7I B( R 2) =7t B(^2) 


JI a (Ri)=Ua( T i), 

(R,=0 <-> 
R 2 =0 <-> 
T,=0-f> 
T 2 =0) 



Fig. 6.3. Structural 
composition vs. state- 
based composition (the 
latter with and without 
NULLs; predicate o 
denotes if-and-only-if) 



For the operator Invert we obtain: 

Invert (map) = Invert(map); 

Invert simply swaps the left and right sides of the morphism; this is exactly 
what Invert does to arbitrary mappings. 

The relationship between Extract and Extract is more subtle. We can show 
that the result of Extract is a valid materialization of the result of Extract, 
as we discussed in Sect. 4.3. Formally: 

(iTi'cj m_m c ) = Extract(m, s) — > 

3m x , m_m x , m^_m x : 

( (m x ,m_m x ) = Extract(m, s); 
m_m x = m_m c o m^_m x ; 
m c = Range (m_m c ) = Domain (m c _m x ); 
lnvert(mo_ma,) o m^_m x = ld(m,j;); ) 

The predicate — > above denotes the logical implication. For clarity, we quan- 
tify the free variables m x , m_m x , m^_m x occurring in the implied part of the 
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statement explicitly. The above formula says that we can get the (exact) 
output of Extract by defining a view on the (materialized) result produced 
by Extract. This view is The condition lnvert(m,^m x ) o m^_m x = 

ld(m x ) requires m^_m x to be a surjective function onto m x (see Proposi- 
tion 4.2.4). 

To illustrate the relationship between m x and m c , consider Fig. 6.4. Let 
s be the identity morphism on the attributes Name, City, and Zip in m. The 
schemas extracted from m by the structural operator and by Definition 4.2.3 
are depicted in Fig. 6.4 as m c and m x , respectively. Schema m x contains three 
tables with a single key attribute each. The mapping m^_m x is depicted using 
arrows to distinguish it from morphisms; both m^rrix and m_m x can be 
defined as views as follows ( m_m c is shown for convenience): 

m c ui x = «m x .Ui = SELECT DISTINCT Name FROM m c . S, 

m x . U 2 = SELECT DISTINCT City FROM m c .T, 
m x . U 3 = SELECT DISTINCT Zip FROM m c .T» 

m rn x = «m x .XJ 1 = SELECT DISTINCT Name FROM m.S, 
m x . U 2 = SELECT DISTINCT City FROM m.T, 
m x . U 3 = SELECT DISTINCT Zip FROM m.T» 

m_m c = «m c .S = 7r Name (m.S), 

7TCity(?Tlc-T) = 7TCity(m.T), 

7rzi P (m c -T) = 7r zip (m.T)» 

It is straightforward to verify that m_m x = m_m c o m^_m x . 




Fig. 6.4. Structural extraction 
yields materialization of the state- 
based operator 



This example illustrates that extraction implemented in Rondo does not 
produce a minimal schema, but nevertheless contains all information neces- 
sary to derive the result required by Definition 4.2.3. The example also shows 
a tradeoff in using weak mappings such as morphisms: on extraction they 
yield schemas that are not very expressive ( m x ). 

Using similar considerations we can verify that the following relationship 
holds for merging: 

(m c , m^_rri 2 ) = Merge(mi, m 2 , mi_TO 2 ) — > 

3m, m_mi , m_m ^ , m c _m : 

( (m,ra_mi,ro_m2) = Merge(mi, m2, mi_W2); 
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TOe_mi = m^_m o m_m i; 

m c 1 TI 2 = m^_m o m_m 2 ; 

m c = Domain (mc_m); 
lnvert(mo_m) o m^_m = ld(?n); ) 

That is, the schema and morphisms produced by Merge are materializations 
of the exact results of Merge. 

Unfortunately, we cannot characterize the operator Delete as easily in 
terms of Diff. Delete is defined as a derived operator, Delete(m, s) := 
Extract(m, All(m) — s), and does not guarantee that we can reconstruct each 
instance of m given an instance of the schema obtained by extraction and 
an instance obtained by deletion. For a counterexample, see Example 4.2.23. 
That is, as we explain in Sect. 10.5, the operator Delete is not suitable for 
computing the view complement in data warehousing scenarios. 

Interestingly, if we assume a different semantics for morphisms, the tuple- 
list semantics discussed in Sect. 6.1, we find that the results of Delete are 
in fact materializations of the results of Diff. This fact reiterates the impor- 
tance of formal specification of semantics: the operators in Rondo may have 
been developed with different morphism semantics in mind. Under value-set 
semantics of morphisms, Delete can be viewed as a variant of the Extract 
operator. 

Other structural operators used in Rondo can be characterized as follows: 

Range(map) = Invert(map) o map; 

Domain(map) = map o Invert(map); 

RestrictDomain(map, s) = s o map ; 

RestrictRange (map, s) — map o s; 

Union(si, s 2 ) = si © S 2 ; 

All(m) D ld(m); 



6.4 Revisiting Change Propagation 

The script that implements change propagation, which we presented in 
Chap. 2, differs from the script developed in Chap. 5. The principal difference 
lies in the way that deletion is propagated. In this section we demonstrate 
that the state-based solution of Chap. 5 can be used to obtain a different 
solution for propagating deletion in Rondo, which turns out to be equivalent 
to the original structural script of Chap. 2. Note that in general, a structural 
script is not equivalent to a state-based script created simply by replacing 
the structural operators by the corresponding state-based operators. 

In Chap. 2, propagation of deletion is implemented using the script 

operator PropagateDeletionsA(si, d 1; si_di, Si_S2) 

{d^du-di) = Delete(di,Traverse(All(si) — Domain(si_S 2 ), si_di)); 
return (d^, di_d[)-, 
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In Chap. 5 we argued that the following state-based specification of propa- 
gating deletions is correct: 

operator PropagateDeletionsB(si, di, Si_d \ , si_S 2 ) 

1. di_S 2 = lnvert(si_di) o s 1 _s 2 ; 

2. (di d , d\_did) = DifF(<i 1 , lnvert(si_di)); // added in d\ 

3. (di x ,di_d\ x ) = Extract(di, di_S 2 ); // kept in d\ 

4. (d'i,d'i_d lx ,d'i_di d ) = Merge(d lx ,d ld ,\n\/ert(di_d lx ) o d^du))] 

5. d\_d'i = d\_d\d o lnvert(d' 1 _did) © d\_d\ x o lnvert(d , 1 _di a; ); 

6. return (d[, di_d[)-, 

Now we show that the implementation of PropagateDeletionsB using a 
one-to-one operator translation into structural operators is equivalent 
to PropagateDeletionA. Whenever a morphism is passed as a parame- 
ter to a structural operator that expects a selector, we simply take the 
Domain of the morphism. For example, Extract(di, d±_S 2 ) of line 3 becomes 
Extract(c?i, Domain(<ii_S 2 ))- If we replace Diff by Delete and expand the 
definition of Delete as Delete(m, s) = Extract ( All ( to) — s) according to the 
definition of Sect. 2.3.3, then lines 2-3 translate to: 

(did, d\_d\d) = Extractor, All(di) — Domain(Invert(si_di))); 

(dix, di_di x ) = Extract(di,Domain(<ii_S 2 )); 

The expression Domain(Invert(si_di)) can be simplified as Range(si_di) (see 
Sect. 2.3.2). The algorithm used in Chap. 2 to implement the Merge operator 
computes the union of models with renaming. In line 4, no conflicts arise 
from using the operator Merge, since the schemas to be merged have been 
extracted from the same model. Therefore, the above two extractions followed 
by a Merge are equivalent to a single extraction over the union of mappings. 
Thus, lines 1-5 translate to: 

(d'i,di_d'i) = Extractor, (All(di) — Range(si_di)) + 

Range(Invert(si_S 2 ) * si_g?i)); 

It remains to show that the above expression is equivalent to: 

(d'i,di_d'i) = Delete(di,Traverse(All(si) — Domain(si_S 2 ), si_di)); 

which can be expanded into 

(d'i,di_d'i) = Extractor, All(di) — 

Traverse(All(si) — Domain (si_S 2 ), Si_di)); 

It is sufficient to show the equality: 

(All(di) — Range (si_di)) + Range (Invert (si_S 2 ) * si_di) = 

All(di) — Traverse(All(si) — Domain(si_S 2 ), Si_di) 

We prove the equality using the schematic representation of Fig. 6.5, which 
uses the following definitions: 
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Fig. 6.5. Schematic representation for structural 
change propagation script 



A\ = Domain(si_S2) D Domain(si_cii) 

A 2 = Domain(si_<ii) — Domain(si_S2) 

A3 = Domain(si^S2) — Domain(si_<ii) 

A4 = All(si) — A\ — A 2 — A3 
B\ = Traverse(Ai, si_di) 

B 2 = Traverse(A 2 , si_di) 

B 3 = All(di) — B\ — B 2 

The selectors Ai, A 2 , A 3 , A4 and B 1, B 2 , B 3 yield disjoint decompositions 
of All(si) and All^), respectively. The following properties hold: 

All(si) = Ai + A2 + A3 + A4 
All(di) = Bi + B 2 + B 3 
Domain(si_di) = Ai + A 2 
Range (si_di ) = Bi + B 2 
Domain(si_S2) = Ai + A3 

Thus, we obtain 

(All(di) — Range(si_di)) + Range(Invert(si_S2) * si < g?i) = 

B\ + B 2 + B 3 - (Bi + B 2 ) + Traverse(Ai, si_di) = 

B\ + B 3 

and 

All(di) — Traverse(All(si) — Domain(si_s 2 ), si_di) = 

B\ + B 2 + B 3 - Traverse (A 2 + A 4 , si_c?i) = 

B\ + B 2 + B 3 — (Traverse(A 2 , si_di) + Traverse(A4, si_di)) = 

Bi + B 2 + B 3 — (B 2 + 0 ) = 

B\ + B 3 

Hence, both realizations of propagating deletions are equivalent. 

Although strictly speaking the structural scripts do not implement the 
state-based scripts, the above example backs the intuition for the following 
conclusions that we expect to hold in general. 

On the one hand, the example illustrates that it may be possible to find 
alternative realizations, or “physical plans”, for a given state-based script. 
Here, a different equivalent realization allows us to use a single complex ope- 
rator Extract instead of three invocations of complex operators Extract, Diff, 
and Merge. Propagating deletions using selectors does not require computing 
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the results of Diff and Merge. The rest of the selector-oriented realization 
PropagateDeletionsA uses the primitive operators, which can be optimized 
by the SQL engine of the underlying DBMS. In other scripts, further poten- 
tial optimization can be obtained by storing precomputed results that are 
used at several places in a script. 

On the other hand, the example shows that it may be possible to optimize 
the execution of a given state-based script even if the script itself cannot be 
further simplified into an equivalent state-based script. This is similar to 
database query processing: we first optimize the logical query plan, then the 
physical query plan. We expect the above observations to hold for proper 
implementations of state-based scripts as well. 



6.5 Conclusions 

In Part II, we explored a state-based semantics for the set of operators put 
forth in model management. We derived the properties of the operators from 
established metadata management scenarios and verified their applicability 
using numerous examples. We obtained a simplified characterization of the 
operators and presented an initial study of their properties. 

The state-based semantics proved instrumental for clarifying the meaning 
of the conceptual structures used in the prototype Rondo. Moreover, the ana- 
lysis that we presented helped us realize a deficiency in the implementation 
of the operator Diff. 

A noticeable feature of the operators Extract, Merge, and Diff are the mi- 
nimality conditions on the output models. These conditions are critical for 
expressing the intended semantics of the operators. For example, if we re- 
move the minimality condition (iii) from the definition of Extract, then the 
operator becomes vacuous: for any input model m we could easily find a tri- 
vial output model, the model m itself. Analogously, removing the minimality 
condition (iv) from Merge would allow the operator to produce a virtually 
arbitrary very expressive model as output. Finally, eliminating condition (ii) 
from the definition of Diff would make Diff a derived operator that could 
be specified in terms of the operators Extract, Merge, Invert, and Compose. 
The minimality conditions help us make the operators non-redundant , but 
necessarily contribute to the complexity of computing their results. The com- 
pleteness and non-redundancy of the suggested set of operators is an open 
problem, which we discuss in more detail in Sect. 11.3.3. 

A major strength of the state-based characterization is its ability to spe- 
cify model-management operations in an abstract fashion, without appealing 
to any idiosyncratic schema, constraint, or transformation languages. Howe- 
ver, applying such an abstract characterization to concrete languages and 
developing practical algorithms for computing the results of operators effec- 
tively can be extremely hard. For example, as we demonstrate in Sect. 10.1.2, 
the problem of answering queries using views can be defined quite easily using 
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a state-based characterization, whereas developing rewriting algorithms for 
specific query languages proved to be very challenging. We expect that com- 
puting the results of model-management scripts for concrete languages, the 
problem which we called materialization, will prove at least as hard. Exten- 
sive future research is required for solving this problem. 

By following a state-based approach we recognize that a model is more 
than just syntax: it is a template for instances. On the other hand, a model is 
more than just a template for instances: its syntax is important for developers 
and applications. For example, although many relational schemas might be 
able to express identical information, applications rely on the fact that they 
contain tables with certain names and certain attributes, whose order in 
the table definition may be important. Therefore, considering both state- 
based and structural semantics is critical for specifying the effects of model- 
management scripts. 

We believe that the state-based semantics should apply not only to Rondo, 
but also to other systems that will be built in the future. In particular, it 
provides guidelines for implementing the operators for much more powerful 
schema and mapping languages as compared to those that we utilized in our 
programming platform. We expand the discussion of future work on state- 
based semantics in Chap. 11. 



Part III 



Schema Matching 




7. Similarity Flooding Algorithm 



“Mr. Martin: . . . You know, in my bedroom there is a bed, and it 
is covered with a green eiderdown. This room, with the bed and the 
green eiderdown, is at the end of the corridor between the w.c. and 
the bookcase, dear lady! 

Mrs. Martin: What a coincidence, good Lord, what a coincidence! 
My bedroom, too, has a bed with a green eiderdown and is at the 
end of the corridor, between the w.c., dear sir, and the bookcase! 

Mr. Martin: How bizarre, curious, strange! Then, madam, we live 
in the same room and we sleep in the same bed, dear lady.” 

- Eugene Ionesco (1958), “The Bald Soprano” 



Finding correspondences between models is required in many application 
scenarios. This task is often referred to as matching. In generic model mana- 
gement, matching is embodied in the operator Match, which plays a critical 
role in many model-management scripts. The operator Match takes two mo- 
dels as input and returns a mapping between the models as output. Of all 
operators that we examined in the previous chapters, Match is the only one 
that lacks a formal definition and, in a way, enjoys a special status. The 
reason for its specialty is that matching typically involves information that 
is not contained in the input models. Uncovering how two models relate to 
each other requires reading documentation, examining instances of models, 
and talking to the engineers who designed or deploy the models. 

Matching is a complex and time-consuming design task. Techniques used 
to automate this task often differ substantially. For example, for matching 
relational schemas one could use SQL data types to determine which columns 
are possibly related. On the other hand, in XML schema matching, hierar- 
chical relationships between schema elements can be exploited. Because of 
this diversity, applications that rely on matching are often built from scratch 
and require significant amount of thought and programming. We address this 
problem by proposing a matching algorithm that allows quick development 
of matchers for a broad spectrum of different scenarios. We are not trying 
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to outperform custom matchers that use highly tuned, domain-specific heu- 
ristics. 

In this chapter we suggest a simple structural algorithm that can be used 
for matching of diverse data structures including schemas, instances, and 
other kinds of models. The algorithm that we suggest is based on the fol- 
lowing idea. The models to be matched, which are represented as directed 
labeled graphs, are used in an iterative fixpoint computation whose results 
tell us what nodes in one graph are similar to nodes in the second graph. For 
computing the similarities, we rely on the intuition that elements of two di- 
stinct models are similar when they occur in similar contexts, i.e. , when their 
adjacent elements are similar. In other words, a part of the similarity of two 
elements propagates to their respective neighbors. The spreading of similari- 
ties in the matched models is reminiscent to the way how IP packets flood the 
network in broadcast communication. For this reason, we call our algorithm 
the Similarity Flooding algorithm. The result produced by the algorithm is 
a morphism, a simple kind of mapping that we presented in Chap. 2. Depen- 
ding on the particular matching goal, we then choose a subset of the resulting 
mapping using adequate filters. 

After our algorithm runs, we expect a human to check and if necessary 
adjust the results. As a matter of fact, we evaluate the “accuracy” of the 
algorithm by counting the number of needed adjustments. We designed a 
graphical tool which helps human developers to inspect and post-process 
the suggestions delivered by the algorithm. In this tool, the user adjusts the 
proposed match result by removing or adding lines connecting the elements 
of two schemas. As we stressed above, the correct match often depends on the 
information only available or understandable by humans. For example, even 
matches as plausible as ZipCode to zip_code can be doomed as incorrect by 
a data warehouse designer who knows that zip codes from a given relational 
source should not be collected due to poor data quality. In such cases, the 
suggested mappings may be incorrect or incomplete. 

This chapter is structured as follows: 

— In Sect. 7.1, we give an overview of the approach. The Similarity Flooding 
algorithm is introduced in Sect. 7.2. 

— In Sect. 7.3, we present a generalized formula for the algorithm and discuss 
its convergence and complexity in Sect. 7.4. 

— In Sect. 7.5, we demonstrate the applicability of the algorithm for diverse 
matching tasks. 

In subsequent chapters, we address the filtering of the results delivered 
by the SF algorithm (Chap. 8) and its evaluation and tuning (Chap. 9). We 
conclude Part III in Sect. 9.6. 
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7.1 Overview of the Approach 

Before we go into details of our matching algorithm, let us briefly walk 
through an example that illustrates matching of two relational database sche- 
mas. Please keep in mind that the technique we describe is not limited to 
relational schemas. Consider schemas Si and S 2 depicted in Fig. 7.1. The 
elements of Si and S 2 are tables and columns. Assume for now that our goal 
is to obtain exactly one matching element for every element in S±. A part 
of the matching result could be, for example, the correspondence of column 
Personnel/Pname to column Employee/EmpName. A sequence of steps that 
allows us to determine the correspondences between tables and columns in 
and S 2 can be expressed as the following script: 



CREATE TABLE Personnel ( CREATE TABLE Employee ( 

Pno int, EmpNo int PRIMARY KEY , 

Pname string, EmpName varchar(50), 

Dept string, DeptNo int REFERENCES Department, 

Born date, Salary dec(15,2), 

UNIQUE perskey(Pno) Birthdate date 

) ) 

(Si) CREATE TABLE Department ( 

DeptNo int PRIMARY KEY . 

DeptName varchar(70) 

) 

(&) 

Fig. 7.1. Matching two relational schemas: Personnel and Employee-Department 



1. Gi = ReadSQLDDL(S'i); G 2 = ReadSQLDDL(S 2 ); 

2. initialMap = StringMatch(Gi, G 2 ); 

3. product = SFJoin(Gi, G 2 , initialMap)', 

4. result = SelectThreshold(pr oduct)', 

As a first step, we translate the schemas from their native format into 
graphs Gi and G 2 . In our example, the native format of the schemas are 
ASCII files containing table definitions in SQL DDL. A portion of the graph 
Gi is depicted in Fig. 7.2. The translation into graphs is done using an import 
filter ReadSQLDDL that understands the definitions of relational schemas. 
We do not insist on choosing a particular graph representation for relational 
schemas. The representation used in Fig. 7.2 is based on the Open Information 
Model specification (Bernstein et al. 1999). The nodes in the graph are shown 
as ovals and rectangles. The labels inside the ovals denote the identifiers of 
the nodes, whereas rectangles represent literals, or string values. For example, 
node &1 represents the table Personnel in graph Gi, whereas nodes &2, & 4, 
and &6 denote columns Pno, Pname, and Dept, respectively. Column Born and 
unique key perskey are omitted from the figure for clarity. Tables Employee 
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and Department from schema S 2 are represented in a similar manner in graph 
d? 2 - In our example, G± has a total of 31 nodes while G 2 has 55 nodes. 



Table 
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-^column 
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Fig. 7.2. A portion of graph 
representation Gi for relational 
schema S\ 



As a second step, we obtain an initial mapping initialMap between G 1 
and G 2 using operator StringMatch. The mapping initialMap is obtained 
using a simple string matcher that compares common prefixes and suffixes 
of literals. A portion of the initial mapping is shown in Table 7.1. Literal 
nodes are highlighted using apostrophes. The second column of the table 
lists similarity values between nodes in G\ and G 2 computed on the basis 
of their textual content. The similarity values range between 0 and 1 and 
indicate how well the corresponding nodes in Gi match their counterparts in 
G 2 . Notice that the initial mapping is still quite imprecise. For instance, it 
suggests mapping column names onto table names (e.g. column Dept in Si 
onto table Department in S 2 , line 9), or names of data types onto column 
names (e.g., SQL type date in Si onto column Birthdate in S 2 , line 8). 



Table 7.1. A portion of initialMap obtained by string matching (10 of total 26 
entries are shown) 



Line# 


Similarity 


Node in Gi 


Node in G 2 


1 . 


1.0 


Column 


Column 


2. 


0.66 


ColumnType 


Column 


3. 


0.66 


“Dept” 


“DeptNo” 


4. 


0.66 


“Dept” 


“DeptName” 


5. 


0.5 


UniqueKey 


PrimaryKey 


6. 


0.26 


“Pname” 


“DeptName” 


7. 


0.26 


“Pname” 


“EmpName” 


8. 


0.22 


“date” 


“Birthdate” 


9. 


0.11 


“Dept” 


“Department” 


10. 


0.06 


“int” 


“Department” 



As a third step, operator SFJoin is applied to produce a refined mapping 
called product between Gi and G 2 . In this chapter we propose an iterative 
“similarity flooding” (SF) algorithm based on a fixpoint computation that is 
used for implementing operator SFJoin. The SF algorithm has no knowledge 
of node and edge semantics. As a starting point for the fixpoint computa- 
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tion the algorithm uses an initial mapping like initialMap. Our algorithm 
is based on the assumption that whenever any two elements in models Gi 
and G 2 are found to be similar, the similarity of their adjacent elements in- 
creases. Thus, over a number of iterations, the initial similarity of any two 
nodes propagates through the graphs. For example, in the first iteration the 
initial textual similarity of strings “Pname” and “EmpName” adds to the si- 
milarity of columns Personnel/Pname and Employee/EmpName. In the next 
iteration, the similarity of Personnel/Pname to Employee/EmpName propa- 
gates to the SQL types string and varchar(50). This subsequently causes 
increase in similarity between literals “string” and “varchar” , leading to a 
higher resemblance of Personnel/Dept to Department/DeptName than that 
of Personnel/Dept to Department/DeptNo. The algorithm terminates after 
a fixpoint has been reached, i.e. the similarities of all model elements stabilize. 
In our example, the refined mapping product returned by SFJoin contains 
211 node pairs with positive similarities (out of a total of 31 • 55 = 1705 
entries in the Gi,G 2 cross-product). 

As a last operation in the script, operator SelectThreshold selects a subset 
of node pairs in product that corresponds to the “most plausible” matching 
entries. We discuss this operator in Chap. 8. The complete mapping retur- 
ned by SelectThreshold contains 12 entries and is listed in Table 7.2. For 
readability, we substituted numeric node identifiers by the descriptions of 
the objects they represent. For example, we replaced node identifier &2 by 
[Column : Personnel/Pno] . 



Table 7.2. The mapping after applying SelectThreshold on result of SFJoin 



Sim. 


Node in Gi 


Node in G 2 


1.0 


Column 


Column 


0.81 


[Table: Personnel] 


[Table: Employee] 


0.66 


ColumnType 


ColumnType 


0.44 


[ColumnType: int] 


[ColumnType: int] 


0.43 


Table 


Table 


0.35 


[ColumnType: date] 


[ColumnType: date] 


0.29 


[UniqueKey: perskey] 


[PrimaryKey: on EmpNo] 


0.28 


[Column: Personnel/Dept] 


[Column: Department/DeptName] 


0.25 


[Column: Personnel/Pno] 


[Column: Employee/ EmpNo] 


0.19 


UniqueKey 


PrimaryKey 


0.18 


[Column: Personnel/Pname] 


[Column: Employee/EmpName] 


0.17 


[Column: Personnel/ Born] 


[Column: Empioyee/BirthdateJ 



As we see in Table 7.2, the SF algorithm was able to produce a good 
mapping between S\ and S 2 without any built-in knowledge about SQL DDL 
by merely using graph structures. For example, table Personnel was matched 
to table Employee despite the lack of textual similarity. Notice that the table 
still contains correspondences like the one between node Column in Gi to node 
Column in G 2 , which are hardly of use given our goal of matching the specific 
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tables and columns. We discuss the filtering of match results in more detail 
in Chap. 8. The similarity values shown in the table may appear relatively 
low. As we will explain, in presence of multiple match candidates for a given 
model element, relative similarities are often more important than absolute 
values. 



7.2 Similarity Flooding Algorithm 

The internal data model that we use for models and mappings is based on 
directed labeled graphs. Every edge in a graph is represented as a triple 
(s,p, o), where s and o are the source and target nodes of the edge, and the 
middle element p is the label of the edge. For a more formal definition of our 
internal data model please refer to Sect. 2.2.1. In this section, we explain our 
algorithm using a simple example presented in Fig. 7.3. The top left part of 
the figure shows two models A and B that we want to match. 




Pairwise connectivity graph 




Induced propagation graph 




Fixpoint values 
for mapping 
between A and B 



1.0 a,b 
0.93 a2.b1 
0.66 a1,b2, 
0.47 al.bl 
0.26 al ,b 
0.26 . a2,b2; 



Fig. 7.3. Example illustrating the Similarity Flooding algorithm 



7.2.1 Similarity Propagation Graph 

A similarity propagation graph is an auxiliary data structure derived from 
models A and B that is used in the fixpoint computation of our algorithm. 
To illustrate how the propagation graph is computed from A and B , we first 
define a pairwise connectivity graph (PCG) as follows: 
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((x,y),p,(x',y')) gPCG^A^) <^=>- (x,p, x’) G A and (y,p, y') € B 

Each node in the connectivity graph is an element from A x B. We call 
such nodes map pairs. The connectivity graph for our example is enclosed in 
a dashed frame in Fig. 7.3. The intuition behind arcs that connect map pairs 
is the following. Consider for example map pairs (a, b) and (ai,&i). If a is 
similar to 6, then probably a± is somewhat similar to b\. The evidence for this 
conclusion is provided by the Zi-edges that connect a to ai in graph A and b 
to bi in graph B. This evidence is captured in the connectivity graph as an 
fi-edge leading from ( a,b ) to (ai,&i). We call (ai,&i) and (a, b) neighbors. 

The induced propagation graph for A and B is shown next to the connec- 
tivity graph in Fig. 7.3. For every edge in the connectivity graph, the propa- 
gation graph contains an additional edge going in the opposite direction. The 
weights placed on the edges of the propagation graph indicate how well the 
similarity of a given map pair propagates to its neighbors and back. These 
so-called propagation coefficients range from 0 to 1 inclusively and can be 
computed in many different ways. The approach illustrated in Fig. 7.3 is ba- 
sed on the intuition that each edge type makes an equal contribution of 1.0 to 
spreading of similarities from a given map pair. For example, there is exactly 
one / 2 -edge out of (ai,6) in the connectivity graph. In such case we set the 
coefficient w((ai,b), (02,62)) in the propagation graph to 1.0. The value 1.0 
indicates that the similarity of ai to b contributes fully to that of a 2 and 
6 2 . Analogously, the propagation coefficient u>((a 2 ,& 2 ), (ai,6)) on the reverse 
edge is also set to 1.0, since there is exactly one incoming Z 2 -edge for (o 2 , 62). 
In contrast, two ^i-edges are leaving map pair (a, b) in the connectivity graph. 
Thus, the weight of 1.0 is distributed equally among w((a, b), (ai, 61)) = 0.5 
and w((a, b), (a 2 , 61)) = 0.5. In Sect. 7.3 we analyze several alternative ways 
of computing the propagation coefficients. 

7.2.2 Fixpoint Computation 

Let a{x,y) > 0 be the similarity measure of nodes x £ A and y £ B defined 
as a total function over Ax B. We refer to a as a mapping. The similarity 
flooding algorithm is based on an iterative computation of er-values. Let a 1 
denote the mapping between A and B after i th iteration. Mapping cr° re- 
presents the initial similarity between nodes of A and B : which is typically 
obtained using string comparisons of node labels. In our example we assume 
that no initial mapping between A and B is available, i.e. a°(x,y) = 1.0 for 
all (x, y) € A x B. 

In every iteration, the cr-values for a map pair (x,y) are incremented by 
the cr-values of its neighbor pairs in the propagation graph multiplied by the 
propagation coefficients on the edges going from the neighbor pairs to (x : y). 
For example, after the first iteration cr 1 (ai, 61) = <r°(ai, 61) + cr°(a, b) • 0.5 = 
1.5. Analogously, cr 1 (a,6) = cr°(a, b) + cr°(ai, &i) • 1.0 + cr°(a 2 , 61) • 1.0 = 3.0. 
Then, all values are normalized, i.e., divided by the maximal a- value (of 
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current iteration) ^(a^b) = 3.0. Thus, after normalization we get a 1 (a, b) = 
1.0, cr 1 (ai,6i) = = 0.5, etc. In general, mapping cr l+1 is computed from 

mapping a 1 as follows (normalization is omitted for clarity): 

a l+1 (x,y) = a l (x,y)+ E a l (a Ul b u ) ■ w((a u ,b u ),(x,y)) + 

( a u ,p,x)£A , ( b u ,p,y)£B 

E a l (a v ,b v ) ■ w((a v ,b v ), (x,y)) 

( x,p,a v )GA , (y,p,b v )£B 

The above computation is performed iteratively until the Euclidean length 
of the residual vector A(a n ,a n ~ 1 ) becomes less than e for some n > 0. 
If the computation does not converge, we terminate it after some maximal 
number of iterations. In Chap. 9, we study the convergence properties of the 
algorithm. The right part of Fig. 7.3 displays the similarity values for the 
map pairs in the propagation graph. These values have been obtained after 
five iterations using the above equation. In the figure, the top three matches 
with the highest ranks are highlighted in bold. These map pairs indicate how 
the nodes in A should be mapped onto nodes in B. 

Taking normalization into account, we can rewrite the above equation to 
obtain the “basic” fixpoint formula shown in Table 7.3. The function tp incre- 
ments the similarities of each map pair based on similarities of their neighbors 
in the propagation graph. The variations A , B , and C of the fixpoint formula 
are studied in Chap. 9. Our experiments suggest that formula C performs 
best with respect to quality of match results and convergence speed. In the 
next section we explain how the fixpoint formulas are derived and present a 
more general formulation of the flooding algorithm. 



Table 7.3. Variations of the fixpoint formula 



Identifier 


Fixpoint formula 


Basic 

A 

B 

C 


<r I+i = normalize^ 1 + <p(a 1 )) 

<r I+1 = normalize(rj° + <p(cd)) 

<r I+1 = normalize(ip(<j° + a *)) 

a l+1 = normalize(a° + a 1 + <p(a° + cr 1 )) 



7.3 Generalized Version of the Algorithm 

The core of the formal definition of the algorithm is based on the function 
ip that takes a mapping a as input parameter and produces mapping 8 as 
output. For any two given models A and B, tp is defined as follows: 

ip(a) = 9 4=> V(a, b) £ A x B : 0(a, b) = 

E cr(a :,y) ■n r ({x,p,A),{y,q,B)) + 

(a,p,x)£A,(b,q,y)£B 

E v(x,y)-Tri{{x,P,A),(y,q,B)) 

(x,p,a)£A,(y,q,b)£B 
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Function <p describes how the similarity of the neighbor pairs of (a, b ) 
“flows” into the similarity of (a, b). Function n defines the propagation coeffi- 
cients for a map pair (x, y) with respect to p-labeled edges in A and g-labeled 
edges in B. The n - function that corresponds to the example described in 
Sect. 7.2 is based on inverse-product number of equilabeled edges in A and 
B computed for each map pair: 



7i \i,r}((x,p, A ), (y,q,B)) 



card{i tT }(x,p,A)-card{i tr - i (y,q,B) ’ 

0 , 



if p = q 
if p^q 



where card(x,p , M) delivers the number of outgoing or incoming edges of 
node x that carry label p in model M : 



cardi(x,p,M) = \{(x,p,t) \ 3t : ( x,p,t ) € M}\ 

card r (x,p,M) = |{(f,p, x) j 3 1 : ( t,p,x ) € M}\ 

The definitions of functions tp and 7r use A and B directly without relying 
on the pairwise connectivity graph. This is a more general approach, since the 
propagation graph typically contains more information than the connectivity 
graph. For example, the propagation coefficients obtained using a 7r-function 
based on inverse average (described below) cannot be computed using just 
the connectivity graph. Finally, in the definition of our algorithm we rely on 
summation and normalization of mappings. These two operations are defined 
as follows. The sum of mappings a and v is a mapping 9 such as: 



V(a :,y)eAxB: 9{x 1 y) = a{x, y) + v(x, y) 



The function normalize projects all similarity values of a mapping into 
the range [0,1]. That is, normalization corresponds to dividing vector a by a 
scalar value that represents the highest similarity value in a: 



9 = normalize (er) 



V(o, b) £ Ax B : 
9(a , b) = 



a(a, b) 



max{s | 3 x,y : cr(x,y) = s} 



Now we can define the main iteration step of our algorithm. In the version 
of the algorithm illustrated in Sect. 7.2, on every iteration, a set of new 
similarity values is computed as follows: 

<j l+1 = normalize{a l + ^(cr 1 )) 

The above computation is performed iteratively until A(a n , cr n ~ 1 ) sa- 
tisfies a chosen precision goal for some n > 0. To ensure convergence and 
efficiency (compare Table 9.3), we use a variation of the algorithm shown 
below: 

a l+1 = normalize (er° + a 1 + ip(a° + a *)) 

The rationale behind this modification is discussed in Sect. 7.4. Our user 
study suggests that the faster converging version of the algorithm does not 
negatively impact the quality of the results. 
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7.4 Convergence and Complexity of the Algorithm 

The fixpoint computation of the similarity flooding algorithm can be expres- 
sed as the following eigenvector computation. Let T be the square matrix 
corresponding to the similarity propagation graph G obtained from models 
A and B. If there is an edge going from map pair j = (a :,y) to i = (x',y') 
with propagation coefficient c, then let the matrix entry tij have the value c. 
Let all other entries have the value 0. Notice that the propagation coefficients 
in G correspond to transition probabilities if T is a transition matrix. 

The fixpoint computation converges when T is an aperiodic, irreducible 
matrix (Ergodic theorem). Matrix T is irreducible if and only if the asso- 
ciated graph G is strongly connected (every node is reachable from every 
other node). To ensure these properties, we can introduce self- loops in G 
by including the summand <t° in the fixpoint equation, for example as 
a l+1 = normalize^ 0 + <^(cr 1 )) . This approach is also referred to in the litera- 
ture as dampening. If cr° assigns a non-zero value to each map pair in Ax B, 
then adding er° is equivalent to modifying G into G' in which all nodes are 
interconnected with certain propagation coefficients. Let T' be the matrix 
associated with G' . 

Now the eigenvector computation can be expressed as follows. Let S be 
a map pair vector that at every position contains a similarity value from 
a for a fixed order of map pairs. One iteration of the fixpoint computation 
corresponds to the matrix-vector multiplication T' x S. Repeatedly multip- 
lying S by T’ yields the dominant eigenvector S* of the matrix T' such as 
T' x S* = AS 1 *, where A is the dominant eigenvalue of T' . In the fixpoint 
equation, normalization corresponds to dividing T' x S* by A. 

The fixpoint computation corresponds to computing Markov chains over 
T. This fact provides an interesting insight into the algorithm. Because T 
corresponds to the transition matrix over the graph G, the obtained simi- 
larity measure can be viewed as the stationary probability distribution over 
map pairs induced by a random walk from pair to pair. This random walk 
corresponds to a manual matching process performed by a human designer 
on models A and B. Suppose that only structural information is available to 
the designer. Starting with a given map pair, the designer infers the similarity 
of another map pair based on the structural properties of A and B. Consider 
that A and B are models of relational schemas. If the designer concludes that 
table ti in A matches table t 2 in B , then there is a certain probability that 
his or her next step will be matching the columns of t\ to those of t% ■ 

The conversion rate of the fixpoint computation depends on the ratio 
between the dominant and the second eigenvalue of T, which are determined 
by the structural properties 1 of G'. Higher dampening values contribute to 
a faster conversion rate of the matrix. For a given precision, using both er° 

1 Asymptotic rate of convergence coincides with the so-called spectral radius of 
the matrix T' 
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and <j l in the variation cr l+1 = normalize(a° + a 1 + <p(a a + a 1 )) of the 
fixpoint formula improves the convergence speed by up to a factor of 5 without 
impeding the quality of the result. 

The convergence of the iterations can be measured using the residual vec- 

T' x S 

tor Ri = — — Si. We can treat | R , t | as an indicator for how well Si 

approximates S* . For many practical purposes we are only interested in the 
resulting order of map pairs and not in the absolute values of the similarity 
coefficients. In such cases, the iterations can be interrupted when the order 
in a certain subset of a mapping with the highest similarity values has stabi- 
lized, i.e. does not change from a n ~ 1 to a n . In many practical scenarios, this 
criterion is already satisfied when |i?j| <0.05. 

Let us now turn to the complexity of the algorithm. The number of ope- 
rations in every iteration of the fixpoint computation is proportional to the 
number of edges in the propagation graph G. This number is in turn propor- 
tional to the product of edge numbers in models A and B. Let Na and Nb 
be the number of nodes in A and B , respectively. If nodes in A and B are 
fully interconnected (every node is directly connected to every other node), 
the edge numbers in A and B are 0(N\) and O(Ng). If all these edges are 
equilabeled, the number of edges in G is 0(N\ ■ Ng). That means, the worst 
case complexity of every iteration is 0(N\ ■ N ^), or 0(|A| • |S|), where |A| 
and \B\ are the numbers of edges in A and B. However, in many common 
scenarios, the average complexity of every iteration is 0(Na- Nb)- For typical 
relational or XML schemas the fixpoint computation converges within 5-30 
iterations. That means that the running time of the flooding algorithm is 
comparable to that of a nested loop join in relational databases (multiplied 
with a small factor) . 

A straight-forward implementation of the fixpoint computation requires 
two occurrences of a-vectors in memory besides <j°. The memory usage is 
important for very large models that may contain parts of dictionaries or 
classification schemas. 



7.5 Features of the Algorithm by Example 

In this section we discuss the features and limitations of the similarity floo- 
ding algorithm using four matching problems. In Sect. 7.1 we demonstrated 
how the algorithm performs on two sample relational schemas encoded as 
directed labeled graphs. Our next example deals with matching of semistruc- 
tured data instances. After that, we illustrate matching of XML schemas. 
The third example addresses matching XML schemas using XML instance 
data. The last example deals with the task of finding related data elements in 
a database. The goal of our discussion in this section is to illustrate the use- 
fulness of the algorithm and the threshold-based filter defined in the previous 
section for different application scenarios. 
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7.5.1 Semistructured Data 

Detecting changes by comparing data snapshots in an important task in 
difference queries, version and configuration management. Fig. 7.4 shows an 
example borrowed from (Chawathe and Garcfa-Molina 1997) that illustrates 
change detection in two labeled trees. The numbers inside the circles are 
node identifiers. The tree T 2 on the right has been obtained from the tree 
T\ on the left by applying a series of transformation operations. First, all 
node identifiers have been replaced. In addition, some subtrees have been 
copied and moved, and a new node (60) has been inserted. In this example, 
we are interested in finding a best match candidate for every node of T 2 
(i.e. a mapping between T 2 and T± that satisfies the cardinality constraint 
[0,n] — [1,1])- We can express the matching procedure using the following 
script: 

1. product = SFJoin(T 2 , Ti); 

2. result = SelectLeft (product)', 




Fig. 7.4. Matching of semi- 
structured data 



Since no initial mapping is passed to SFJoin, the initial similarities bet- 
ween all nodes are set to 1.0. We are using operator SelectLeft instead of 
SelectThreshold to ensure that all nodes of T 2 are present in the resulting 
mapping (we discuss filtering in detail in Chap. 8). For every “left” node of 
the mapping, SelectLeft returns the match candidate with the highest ab- 
solute similarity. The result of matching is shown in Table 7.4. The fourth 
column in the table describes the transformation operations performed on 
the nodes (this information it is not part of the resulting mapping and is 
provided for illustration only). As the table suggests, the algorithm could 
correctly map every node in the modified tree T 2 to its previous version in 
T\. Notice a heavy drop in similarity for copied, moved and inserted nodes. 
This result supports the intuition that exact structural matches should yield 
higher similarity values. 

The right-most columns of the table show the relative similarities of the 
nodes in T\ and T 2 . For instance, node 62 is the top candidate for node 8, so 
< J re i (62,8) = 1. For node 53, i.e. the second best candidate, cr re i (53,8) = 
^ 28 ) = ~ 0.16. If instead of SelectLeft we applied SelectThreshold 

with any t re i 6 (0.16, 1] to the result of SFJoin, we would get all map pairs 
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Table 7.4. The mapping after applying SFJoin o SelectLeft to semistructured data 
in Fig. 7.4 
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for Xj-nodes that have been either just renamed or moved. Lowering t re i 
to 0.10 causes all copied nodes to appear additionally in the result. Finally, 
setting t re i to a value like 0.05 includes the inserted node (but still filters out 
the rest of total 130 map pairs returned by SFJoin). This example illustrates 
that in certain scenarios undesired results can be pruned quickly by modifying 
threshold values interactively. 

7.5.2 XML Schemas 

The next example that we discuss illustrates how our algorithm copes with 
different choices of graph-based representation for the models to be matched. 
Consider two XML schemas in Fig. 7.5. The schemas are specified using the 
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<Schema name=" Schema 1" 

xmlns="urn: schemas-microsof t-com : xml-data" > 

<ElementType name= " Account Owner " > 

<element type="Name"/> 

<element type="Address"/> 

<element type="Birthdate"/> 

<element type="TaxExempt"/> 

</ElementType> 

<ElementType name="Address"> 

<element type="Street"/> 

<element type="City"/> 

<element type="State"/> 

<element type="ZIP"/> 

</ElementType> 

</Schema> 

<Schema name=" Schema 2" 

xmlns="urn: schemas-microsof t-com : xml-data" > 

<ElementType name=" Customer "> 

<element type="Cname"/> 

<element type="CAddress"/> 

</ElementType> 

<ElementType name=" Customer Address " > 

<element type="Street"/> 

<element type="City"/> 

<element type="USState"/> 

<element type="PostalCode"/> 

</ElementType> 

</Schema> 

Fig. 7.5. Matching of two XML schemas: AccountOwner (Si) vs. Customer (S 2 ) 



XML schema language deployed on the website biztalk.org designed for 
electronic documents used in e-business. 

As in the example of matching relational schemas (Sect. 7.1), both XML 
data structures are first converted algorithmically into graphs. Fig. 7.6 shows 
portions of two different graph-based representations that are frequently used 
for manipulating XML data structures. The XML graph representation on 
the left corresponds to that of OEM/Lore (Papakonstantinou et al. 1995), 
while the representation on the right is based on the XML/DOM standard. In 
the OEM representation, element tags are treated as edge labels, whereas in 
DOM representation hierarchical relationships between elements are captured 
using a uniform edge labels child. 

The result of matching AccountOwner and Customer schemas is depicted 
in Table 7.6. Two left-most columns show the similarity values for computed 
map pairs. Omitted values indicate that the corresponding map pair does not 
appear in the match result. For readability, we substituted numeric node iden- 
tifiers by the descriptions of the objects they represent (in square brackets). 
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Fig. 7.6. Two different representations of XML data: OEM/Lore-like vs. 
XML/DOM-like 



The mapping for the OEM representation was obtained by executing the 
script 

1. Gi = XML20EMGraph(5' 1 ); G 2 = XML20EMGraph(S , 2 ); 

2. initialMap = StringMatch(Gi, G 2 ); 

3. product = SFJoin(Gi, G 2 , initialMap)', 

4. result = SelectThreshold(pr oduct)\ 

For exploiting the DOM representation, the first line is replaced by 

1. Gi = XML2DOMGraph(Si); G 2 = XML2DOMGraph(5' 2 ); 

This example illustrates two features of the algorithm. First, the algorithm 
produces similar results for different choices of graph-based representation. 
Second, the example shows that graph-based representations for models that 
use a wider spectrum of edge labels contributes to a faster iterative com- 
putation. The sizes of the graphs in both representations are presented in 
Table 7.5. Notice that although the graphs for Si and S 2 have similar sizes 
in both representations, the propagation graph in the OEM representation 
is 50% smaller than that of the DOM-like representation. Thus, every fix- 
point iteration takes less time (we discuss the complexity of the algorithm 
in detail in Sect. 7.4). Also note that the only extra code required for adap- 
ting the algorithm for matching XML schemas is the implementation of the 
XML20EMGraph or XML2DOMGraph operator. 



Table 7.5. Parameters of the fixpoint computation for Si and S 2 



Nodes in Si 


Nodes in S 2 


Nodes in propagation graph 


Iterations 


37 


39 


128 


7 


40 


38 


267 


6 



7.5.3 Matching XML Schemas Using Instance Data 

Two previous examples illustrated matching of instance data and matching 
of schema data. The third example that we discuss in this section deals with 
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Table 7.6. Match results for XML schemas in Fig. 7.5 using two different graph 
representations 
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yet another matching problem, matching XML schemas using instance data. 
Consider two XML instances depicted in Fig. 7.7. Suppose that the XML 
tags used in the instances are defined in some schemas (not shown in the 
figure) and our goal is to establish the correspondences between the tags. 
The data on the left contains information about a Sony camcorder on the 
amazon . com website. The data on the right shows similar information from 
the yahoo . com website. XML tag names for both schemas were derived from 
the actual vocabulary terms used on both sites. For example, Amazon site 
uses term review, whereas Yahoo site talks about rating. Notice that many 
text pieces in both XML files are different. 



<amazon> 

<item> 

<title>Sony DCR-PC100 Digital HandyCam 
Camcorder</title> 

<listPrice>1899 . 99</listPrice> 

<ourPrice>1699 . 00</ ourPrice> 

<youSave>200 . 00</youSave> 

<review> 

<avgReview>4 . 5</ avgReview> 

<numOf Reviews>20</mim0fReviews> 

</review> 

<availability>On Order; usually ships 

within 1-2 weeks</availability> 

<f eatures> 

<zoom>10x optical zoom</zoom> 

<zoom>120x digital zoom</zoom> 

<lcd>2.5 inch LCD</lcd> 

<other>4 MB Memory Stick included</other> 

</f eatures> 

</item> 

</amazon> 

<yahoo> 

<productInf o> 

<id>Sony DCR-PC100</id> 

<merchantPrice>1799 . 94</merchantPrice> 

<rating> 

<userRating>3 . 5</userRating> 

<userReviews>7</userReviews> 

</rating> 

<description> 

<LCDScreenSize>2 . 5in</LCDScreenSize> 

<opticalZoom>10 X</opticalZoom> 

<special>4MB Memory Stick</special> 

</description> 

</productInf o> 

</yahoo> 

Fig. 7.7. Matching of two XML schemas using instance data in DOM graph re- 
presentation 
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Table 7.7 shows how XML tags used in amazon and yahoo match. This 
result was determined by running our algorithm on XML/DOM graphs cor- 
responding to both data instances. After that, the match candidates that 
do not correspond to XML tags were filtered out using a custom operator 
XMLMapF ilter : 

1. Gi = XML2DOMGraph(d&i); G 2 = XML2DOMGraph(d& 2 ); 

2. initialMap = StringMatch(Gi, G 2 ); 

3. product = SFJoin(Gi, G 2 , initialMap)', 

4. result = XMLMapFilter(prodwct, Gi, G 2 ); 

Setting the minimal similarity t a i, s to 0.05 returns a set of correspondences 
shown above the horizontal bar in the table. Notice that the only additional 
code required for using the algorithm for matching XML schemas on the basis 
of instance data was the implementation of operator XMLMapFilter. 



Table 7.7. Match results for XML element tags in Fig. 7.7 using similarity thres- 
hold 0.05 



Similarity 


Tag in dbl 


Tag in db2 


0.27 


item 


ProductInfo 


0.20 


amazon 


yahoo 


0.18 


zoom 


opticalZoom 


0.12 


features 


description 


0.11 


ourPrice 


merchantPrice 


0.11 


listPrice 


merchantPrice 


0.09 


title 


id 


0.08 


numOfRe views 


userReviews 


0.07 


other 


special 


0.06 


led 


LCDScreenSize 


0.05 


review 


userReviews 


0.04 


avgReview 


userReviews 


0.04 


review 


rating 


0.03 


youSave 


id 


0.03 


avgReview 


user Rating 



7.5.4 Finding Related Data 

One last application that we illustrate in this section deals with finding rela- 
ted data instances. The relatedness information can be computed using the 
same instance graph for both inputs of the algorithm. Consider the instance 
graph in Fig. 7.8. This graph captures a piece of information about four fa- 
culty members of the Stanford Database Group. The data says that Jennifer 
works with Hector on the project WHIPS and that she wrote a textbook to- 
gether with Jeff. Table 7.8 shows the relative similarities between the faculty 
members. The match result was obtained using the trivial script result = 
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SFJoin(G, G). “Perfect” match candidates with <7 re ;= 1 like (Gio, Gio) are 
omitted in the table for brevity. Also, we substituted the identifiers of the 
faculty members by their names, e.g. &5 by Jennifer. Since relative similarity 
is not symmetric, Jeff is related to Jennifer closer than Jennifer to Jeff. 
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Fig. 7.8. Excerpt of re- 
lationships in the Stan- 
ford DB Group 



Table 7.8. Relatedness of faculty members in the DB group based on data in 
Fig. 7.8 



Faculty 


Relative similarity (& r ei) 


Faculty 


Hector 


0.40 


Jennifer 




0.14 


Jeff, Gio 


Jeff 


0.40 


Jennifer 




0.14 


Hector, Gio 


Jennifer 


0.32 


Jeff, Hector 




0.11 


Gio 


Gio 


0.19 


Hector, Jennifer, Jeff 



Other applications that we used in our experiments include matching 
of ER, UML and RDFS schemas, comparing product catalogs, approximate 
queries, matching of service invocations, and matching of mappings. To sum- 
marize, notice that the examples that we discussed in this section differ quite 
a lot from each other. They illustrate diverse application scenarios, the se- 
mantics of the nodes in the respective graph representations is different, even 
the matching goals vary. Common to all these examples is, however, that 
different matching tasks could be addressed in a uniform fashion using a very 
limited amount of custom code. In all scenarios, the similarity flooding algo- 
rithm could be deployed by providing converters into graph representation 
for native formats and selecting the desired subsets from the result of SFJoin. 
These selection techniques, or filtering, is the subject of the next chapter. 

The purpose of the examples presented above is to illustrate that the 
algorithm is applicable to a broad range of matching problems. Of course, 
the examples do not substitute a comprehensive evaluation. In this thesis, we 
focus on schema matching and evaluate our algorithm using several schema 
matching problems in Chap. 9. The effectiveness of the algorithm for instance 
matching or finding related data remains to be investigated in future work. 
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“The more alternatives, the more difficult the choice.” 
Abbe d’Allainval (1695-1753) 



In this chapter we examine several filters that can be used for choosing the 
best match candidates from the list of ranked map pairs returned by the Simi- 
larity Flooding algorithm. Usually, for every element in the matched models, 
the algorithm delivers a large set of match candidates. Hence, the immediate 
result of the fixpoint computation may still be too voluminous for many 
matching tasks. For instance, in a schema matching application the choice 
presented to a human user for every schema element may be overwhelming, 
even when the presented match candidates are ordered by rank. We refer to 
the immediate result of the iterative computation as multimapping , since it 
contains many potentially useful mappings as subsets. 

It is not evident which criteria could be useful for selecting a desirable 
subset from a multimapping. An additional complication is that as many as 
2" different subsets can be formed from a set of n map pairs. To illustrate the 
selection problem, consider the match result obtained for two tiny models A 
and B that is shown on the left in Fig. 8.1 (the models themselves are are 
omitted in the figure for clarity). The multimapping M contains four map 
pairs with similarities a(ai,bi) = 1.0, a(a 2 , bi) = 0.54, etc. From the set of 4 
pairs, 2 4 = 16 distinct subsets can be selected. Every one of these 16 subsets 
may be a plausible alternative for the final match result presented to the 
user. 

Possible selections: 

[1.1H1.1] 

cardinality 
constraint 

m, u Fig. 8.1. Cumulative simila- 

but M, is stable marriage! r ity vs. “stable marriage” 



(a, b,) 
( a 2 b 2 ) 



(a, b,) 
(a 2 b,) 



Multimapping M 
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Fig. 8.2. Relative similarities for the example in Fig. 8.1 



We address the selection problem using a three-step approach. 

— First, as discussed in Sect. 8.1, we use the available application-specific 
constraints to reduce the size of the multimapping. As exemplified below, 
typing and cardinality constraints may help to eliminate many map pairs 
from the multimapping. 

— As a second step, presented in Sect. 8.2, we use selection techniques de- 
veloped in context of matching in bipartite graphs to pick out the subset 
that is finally delivered to the user. 

— At last, we evaluate the usefulness of particular selection techniques for 
a given class of matching tasks (e.g. schema matching) and choose the 
technique with empirically best results. 

In this chapter, we discuss the first two steps in more detail. In Sect. 8.3, 
we present an efficient algorithm for computing one of the best-performing 
filters and discuss its SQL implementation in Sect. 8.4. The evaluation of the 
selection techniques is presented in Chap. 9. 



8.1 Constraints 

Frequently, matching tasks include application-specific constraints that can 
be used for pruning of a large portion of possible selections. Recall our relatio- 
nal schemas Si (Personnel) and S 2 (Employee) from Sect. 7.1. At least two 
useful constraints are conceivable for this matching scenario. First, we could 
use a typing constraint to restrict the result to only those matches that hold 
between columns or tables, i.e., we can ignore matches of keys, data types 
etc. Second, if our goal were to populate the Personnel table with data from 
the Employee table, we could deploy a cardinality constraint that requires 
exactly one match candidate for every element of schema Si. In this case, 
the cardinality of the resulting mapping would have to satisfy the restriction 
[0, n\ — [1, 1] (using the UML notation). The right expression [1, 1] limits the 
number of ^-elements that may match each element of Si to exactly one 
(between a lower limit of 1 and an upper limit of 1). Conversely, the left 
expression [0, n] specifies the valid number of Si-match candidates (between 
0 and n) for each element of S 2 , i.e., elements of S 2 may remain unmatched 
or may have one or more match candidates. 

Unfortunately, in many matching tasks typing or cardinality constraints 
do not narrow down the match result sufficiently. To illustrate, consider the 
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multimapping in Fig. 8.1. If the definition of the matching task implies a 
cardinality constraint [0,n] — [1,1] (i.e. , the mapping is required to contain 
exactly one match candidate for every element in A), 4 of 16 selections re- 
main possible. A stricter cardinality constraint [1,1] — [1,1] (i.e. one-to-one 
mapping) limits our choice to two sets of map pairs M\ and M 2 shown on the 
right in Fig. 8.1. Even after applying tight constraints in this simple matching 
task we are still left with more than one choice. Below we examine several 
strategies for making the decision between the remaining alternatives M, . 



8.2 Selection Metrics 

To make an educated choice between Mi’s we need an intuition of what 
constitutes a “better” mapping. Fortunately, our selection dilemma is closely 
related to well-known matching problems in bipartite graphs, so that we 
can build on intuitions and algorithms developed for solving this class of 
problems (see e.g. (Lovasz and Plummer 1986; Gusfield and Irving 1989)). 
In the graph matching literature, a matching is defined as a mapping with 
cardinality [0,1] — [0,1], i.e., a set of edges no two of which are incident on 
the same node. A bipartite graph is one whose nodes form two disjoint parts 
such that no edge connects any two nodes in the same part. Thus, a mapping 
can be viewed as an undirected weighted bipartite graph. 

A helpful intuition that we will predominantly use for explaining alterna- 
tive selection strategies for multimappings is provided by the so-called stable 
marriage problem. To remind, in an instance of the stable marriage problem, 
each of n women and n men lists the members of the opposite sex in order of 
preference. The goal is to find the best match between men and women. A 
stable marriage is defined as a complete matching of men and women with 
the property that there are no two couples (x, y) and (x ' , y ') such that x 
prefers y' to y and y' prefers x to x' . For obvious reasons, such a situation 
would be regarded as unstable. Imagine that in Fig. 8.1 elements a\ and a 2 
correspond to women. Then, men b\ and & 2 would be the primary and the 
secondary choice for woman a\. Obviously, mapping Mi satisfies the stable 
marriage condition, whereas M 2 does not. In 1W 2 , woman a\ and man b\ favor 
each other over their actual partners, which puts their marriages in jeopardy. 

The stable-marriage property provides a plausible criterion for selecting a 
desired mapping from a multimapping. Further candidates for desired map- 
pings can be drawn from the following selection criteria and well-known mat- 
ching problems: 

— The assignment problem consists in finding a matching Mi in a weighted 
bipartite graph M that maximizes the total weight (cumulative similarity) 
y)eMi a ( x > y)- Viewed as a marriage, such matching maximizes the 
total satisfaction of all men and women. In Fig. 8.1, a = 0.81 + 0.54 = 
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1.35, whereas Xmi a = 1-0 + 0.27 = 1-27. Thus, M 2 maximizes the total 
satisfaction of all men and women even though M 2 is not a stable marriage. 

— Another group of selection candidates are maximal, maximum and per- 
fect matchings. A maximal matching is a matching that is not properly 
contained in any other matching. A maximum matching is a matching of 
maximum cardinality, i.e. , with the most number of married couples. A 
perfect (or complete) matching is one containing an edge incident of every 
node, i.e., the one in which every man and woman is married. Obviously, 
a perfect matching is achievable only if both parts of a mapping contain 
the same number of elements. Mi and M 2 in Fig. 8.1 are maximal, ma- 
ximum and perfect matchings. All of the above-mentioned matching pro- 
blems produce [0, 1] — [0, 1] mappings, i.e., monogamous marriages, and can 
be solved using polynomial-time algorithms (Lovasz and Plummer 1986; 
Motwani and Raghavan 1995). 

— Under polygamy , multiple matching counterparts for every element are al- 
lowed. Polygamy is useful for matching tasks in which many-to-many map- 
pings are desirable. In schema matching, for instance, an element of one 
schema may have multiple counterparts in another schema. A polygamous 
variant of perfect matching corresponds to an outer match , i.e., a minimal 
mapping in which every element in both models has at least one counter- 
part. When multiple partners are allowed, the number of candidates for 
every element can be used as an additional factor for selecting the desired 
subset. For example, we may favor a subset M,; of the multimapping that 
maximizes function Xm \( x ?)M(? y )\ ! ana l°g° us ly to the optimum func- 
tion used in the assignment problem. Terms |(x,?)| and |(?,j/)| denote the 
number of partners for woman x and man y in M,> 

— The flooding algorithm produces at most one similarity value for any map 

pair (x,y). We call this value absolute similarity. Absolute similarity is 
symmetric, i.e., x is similar to y exactly as y to x. Under the marriage in- 
terpretation, this means that any two prospective partners like each other 
to the same extent. Considering relative similarities suggests a more diver- 
sified interpretation. Relative similarities are asymmetric and are computed 
as fractions of the absolute similarities of the best match candidates for 
any given element. In the example in Fig. 8.1, 61 is the best match can- 
didate for 02, so we set er re i (02,61) := 1.0. The relative similarity for all 
other match candidates of 02 is computed as a fraction of 0(02, 61). Thus, 
(? re i (02,62) := = 0.5. All relative similarities for this ex- 

ample are summarized in Fig. 8.2. A multimapping based on relative simi- 
larities corresponds to a directed weighted bipartite graph. The previously 
mentioned selection strategies can be adapted to relative similarities in a 
straightforward way. 

— Some matching tasks require finding a connected subgraph in the target 
model that matches best the one in the source model. In such case, the 
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number of edit operations needed to transform one subgraph to another 
may be included in the selection metric. 

— Similarity thresholds are the last criteria that we discuss. For a given 
absolute-similarity threshold t a b s we select a subset of a multimapping, 
in which all map pairs carry an absolute similarity value of at least t a b s . 
For example, for t a b s = 0.5, Fig. 8.1 suggests that woman 02 finds man b± 
acceptable (0.54), and would rather not go out with man 62 at all (0.27). 
The relative-similarity threshold t re i is used analogously. In the same ex- 
ample, for a relative-similarity threshold t re i = 0.5 woman 02 would still 
accept man 62 as a partner, but man 62 would reject woman 02 since 
&rel ( 02 , 62 ) = 0.33 < 0.5. 

To summarize, the filtering problem can be characterized by providing a 
set of constraints and a selection function that picks out the “best” subset of 
the multimapping under a given selection metric. Conceptually, the selection 
function assigns a value to every subset of the multimapping. The subset 
for which the function takes the largest/smallest value is selected as the 
final result. For example, using the assignment problem as selection metric, 
we can construct a filter that applies a cardinality constraint [0, 1] — [0, 1] 
and utilizes a selection function y ) eM . a(x,y) to choose the best subset. 
Some selection metrics (e.g., threshold-based ones) can be described in terms 
of a boolean selection function that assigns the value 1 for one subset of the 
multimapping, and 0 to all others. In concrete implementations of selection 
functions, we can often find algorithms that avoid enumerating all subsets of 
the multimapping and determine the desired subset directly. 

In the remainder of this section we describe a filter that produced empi- 
rically best results in a variety of schema matching tasks, as we show later in 
Chap. 9. This approach is implemented in our testbed as the SelectThreshold 
operator. The intuition behind this approach is based on a perfectionist egali- 
tarian polygamy , which means that no male or female is willing to accept any 
partner(s) but the best. This criterion corresponds to using relative-similarity 
threshold t re i = 1.0. 

SelectThreshold operator selects a subset of the multimapping which is 
guaranteed to satisfy the stable-marriage property. However, this selection 
strategy sacrifices the happiness of those individuals who are not number 
one on the list of at least one person of the opposite sex. Such individu- 
als are left unmarried, i.e. , excluded from the mapping. Most of the time, 
SelectThreshold with t re i = 1.0 yields matchings, or monogamous societies. 
In a less picky version of the operator with t re i < 1.0, more persons have a 
chance to find a partner, and polygamy is more likely. In the examples pre- 
sented in the following section we demonstrate the impact of threshold value 
t re i in several practical scenarios. 
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8.3 FilterBest Algorithm 

For large graphs, the immediate result produced by the flooding algorithm 
can be very large. For example, given two graphs with 5,000 equilabeled edges 
each, the resulting similarity vector contains 25,000,000 elements. Therefore, 
filtering needs to be done efficiently. In this section, we discuss the efficient 
implementation of the SelectThreshold filter. 

A straightforward approach is to sort all pairs in the order of decreasing 
similarity and extract the best matching candidates in a single pass over the 
sorted list of pairs. However, sorting has 0(n log n) complexity. Fortunately, 
there is a simple algorithm that we call FilterBest, which extracts the desired 
subset of the mapping in 0(n) time. 

The algorithm FilterBest is presented below. It takes as input morphism 
map represented as a list of triples (l, r, a) where l and r denote the nodes 
in the matched graphs and cr is the computed absolute similarity value. The 
algorithm returns as output a morphism that satisfies the stable-marriage 
property, i.e. , no candidate match is included for a given node if a better 
candidate is available. Notice that the algorithm FilterBest does not produce 
a matching in the sense used in (Lovasz and Plummer 1986), since some 
nodes may remain unmatched or have multiple top match candidates. 

Algorithm FilterBest(map) 

// map is represented as array of pairs 
cmap := empty hash table; 

/ / cmap maps each node to a linked list 
//of candidate nodes of equal similarity 
rejected := boolean array of size length(map ) 
for i := 1 to length(map) do 
ProbeCandidate(map[i] .left, map[i\ .right, i); 
end for 

clear cmap hash table; 

for i := 1 to lengthfmap) do 

ProbeCandidat e(map[i].right, map[i\.left, i ); 

end for 

result := empty list of pairs; 
for i := 1 to length(map ) do 

if not rejected[i\ then add pair map[i ] to result; 
end for 
return result; 

Procedure ProbeCandidate(node, candidate, i) 

II uses hash table cmap and array rejected from above 
clist := retrieve linked list of candidates for node from cmap; 
if clist is empty then 

append position i to clist; 

else if candidate is more similar to node than those in clist 
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for each j in clist do rejected[j] := true ; end for 
else if candidate is less similar 
rejected[i] := true ; 
else // candidate is equally similar 
append position i to clist ; 
endif 
return ; 

FilterBest computes the result using two passes over the input morphism 
map , and two auxiliary data structures: a hash table ( cmap ) and a boolean 
array (rejected). The first pass makes sure that the stable-marriage property 
is satisfied for the left nodes of map , while the second pass verifies the right 
nodes of map. During each pass, the top seen candidates are stored in the 
hash table cmap , which associates each left/right node with a linked list of top 
right/left candidate nodes. All candidates contained in each linked list have 
identical (absolute) similarity. Once a better candidate has been spotted, the 
currently best candidates are marked as rejected. The rejected candidates 
will not appear in the final result. 

To maintain the rejected candidates efficiently, a single boolean array 
rejected is used in both passes. Instead of keeping track of lists of rejected 
candidates, we can simply mark the pairs of the input morphism map as 
rejected, since whenever a is not a top match candidate for 6, b cannot be a 
match candidate for a due to the perfectionist egalitarian polygamy principle 
explained above. Since all pairs are passed as a list, they can be marked using 
a boolean array of the same length as map. A given pair may be rejected in 
either pass, or in both passes. 

After both passes have been done, the resulting morphism can be obtained 
as a list of non-rejected pairs. The left and right nodes of each pair are gua- 
ranteed to be mutually best candidates, otherwise the pair would have been 
rejected. It is easy to see that the algorithm has the asymptotic complexity of 
0(n) in the number of pairs of the input morphism. To see that, notice that 
the procedure ProbeCandidate is called 2 n times. Appending of the candida- 
tes in ProbeCandidate corresponds to pushing elements on a stack, whereas 
the internal for-loop can be seen as a series of pop operations. Since each pair 
can be “pushed” and “popped” at most once, the total number of these ope- 
rations is bound by n + n = 2n in each pass. The last for-loop of FilterBest 
does another n operations. That is, in the worst case, the algorithm performs 
2n + 2n + 2n + n = 7n steps. 

If the set of left elements of map is disjoint with the set of right elements, 
both probing for-loops of the FilterBest algorithm can be merged into one as 
follows: 

for i := 1 to length(map) do 

ProbeCandidate(?nap[i].Ze/f, map[i\.righ.t , i); 

ProbeCandidate(?nap[i] .right, map[i\.left, *); 

end for 
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This modification allows saving one iteration over the map pairs (yielding 
6 n steps in worst case) , but increases the amount of memory used temporarily, 
since the auxiliary hash table cmap has to keep both the candidates for the 
left nodes and the candidates for the right nodes. 



8.4 Expressing FilterBest in SQL 

If the input morphism map is stored as a single relational table with the 
attributes left, right, and sim, the filtering can be expressed using a nested 
SQL query shown below: 

SELECT map. left, map. right 
FROM map, 

(SELECT left AS L, max (sim) AS M 
FROM map 

GROUP BY left) AS Tl, 

(SELECT right AS R, max (sim) AS M 
FROM map 

GROUP BY right) AS T2 

WHERE map. left = Tl.L AND map. right = T2.R AND 
map. sim = Tl.M AND map. sim = T2.M 

To understand why this query works, consider a small example depicted in 
Fig. 8.3. The example shows the input morphism map and two intermediate 
tables, Tl and T 2, which correspond to the nested SELECT clauses shown 
above. The table Tl defined in the first clause uses a group- by statement to 
extract the maximal similarity values for each left element of the morphism 
map. The second nested SELECT clause yields the table T 2 which associates 
each right element of map with its maximal similarity value. In the example, 
Tl and T 2 have 2 and 3 rows each, according to the domain and range sizes 
of map. 
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Fig. 8.3. Example illustrating execution of 
FilterBest in SQL 



In the top SELECT statement, the tables map, Tl, and T 2 are joined to 
obtain the final result. The important portion is the WHERE clause, which 
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ensures that each pair in map appears in the result only if its similarity 
value is the maximal similarity value for both the left and right element. In 
Fig. 8.3, the first row of map does not join with the first row of T 1, because 
the similarity value map.sim is unequal to the value of Tl.M (0.1 < 0.2). 
Therefore, the pair (al, 61) does not appear in the result, because there must 
be a better candidate for al than 61, namely 62. Tuples (a2, 62), (a2,63) are 
produced as the result of the query. 

The declarative specification of the SQL query shown above can be execu- 
ted very efficiently by the query optimizer. In fact, it suggests an alternative 
linear-time filtering algorithm, which yields the result identical to that of 
the FilterBest algorithm of Sect. 8.3. First, the maximal similarity values are 
computed in a single pass over map using two hash tables, which play the 
role of tables XI and X 2. After that, a single pass over map is done, during 
which a lookup in T1 and X 2 is performed for each pair of map. This alter- 
native in-memory algorithm has similar performance characteristics as the 
FilterBest algorithm. 

The SQL specification presented in this section supports filtering the 
match results backed by the secondary storage. Thus, even very large match 
results can be filtered efficiently using a database system. The SQL approach 
could be particularly useful if the Similarity Flooding algorithm is imple- 
mented using a set of SQL statements and the results already reside in a 
database. 
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“Evaluation is creation: hear it, you creators! Evaluating is itself 
the most valuable treasure of all that we value. It is only through eva- 
luation that value exists: and without evaluation the nut of existence 
would be hollow. Hear it, you creators!” 

- Friedrich Nietzsche (1844-1900) 



In this chapter, we suggest an accuracy metric for evaluating automatic 
schema matching algorithms and evaluate the effectiveness of the SF algo- 
rithm on the basis of a user study that we conducted. 

A crucial issue in evaluating matching algorithms is that a precise defi- 
nition of the desired match result is often impossible. In many applications 
the goals of matching depend heavily on the intention of the users, much like 
the users of an information retrieval system have varying intentions when 
doing a search. Typically, a user of an information retrieval system is looking 
for a good, but not necessarily perfect search result, which is generally not 
known. In contrast, a user performing say schema matching is often able to 
determine the perfect match result for a given match problem. Moreover, the 
user is willing to adjust the result manually until the intended match has 
been established. Thus, we feel that the quality metrics for matching tasks 
that require tight human quality assessment need to have a slightly different 
focus than those developed in information retrieval. 

The quality metric that we suggest below is based upon user effort needed 
to transform a match result obtained automatically into the intended result. 
We assume a strict notion of matching quality i.e., being close is not good 
enough. For example, imagine that a matching algorithm comes up with five 
equally plausible match candidates for a given element, then decides to return 
only two of them, and misses the intended candidate(s). In such case, we give 
the algorithm zero points despite the fact that the two returned candidates 
might be very similar to what we are looking for. Moreover, our metric does 
not address iterative matching, in which the user repeatedly adjusts the result 
and invokes the matching procedure. Thus, the accuracy results we obtain 
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here can be considered “pessimistic”, i.e. , our matching algorithm may be 

“more useful” that what our metric predicts. 

This chapter is structured as follows: 

— In Sect. 9.1, we define the metric that we use for evaluating the matching 
quality, called matching accuracy. 

— In Sect. 9.2, we argue that the intended match result needs to be known 
precisely to evaluate the matching quality. 

— The user study in which we gathered indented match results for several 
tasks is presented in Sect. 9.3. 

— The evaluation of the SF algorithm and filters is described in Sect. 9.4. 

— In Sect. 9.5, we study the impact of different ways of computing propaga- 
tion coefficients on overall matching accuracy in the user study. 

We conclude the chapter, and Part III, in Sect. 9.6. 



9.1 Matching Accuracy 

Our goal is to estimate how much effort it costs the user to modify the 
proposed match result P = {(aq, jq), . . . , (x n , y n )j into the intended result 
I = {(aq, bi), . . . , (a m , b m )}. The user effort can be measured in terms of 
additions and deletions of map pairs performed on the proposed match re- 
sult P . One simplified metric that can be used for this purpose is what we call 
match accuracy. Let c = \P 0 I\ be the number of correct suggestions. The 
difference (n — c) denotes the number of false positives to be removed from 
P , and (m — c) is the number of false negatives, i.e., missing matches that 
need to be added. For simplicity, let us assume that deletions and additions 
of match pairs require the same amount of effort, and that the verification of 
a correct match pair is free. If the user performs the whole matching proce- 
dure manually (and does not make mistakes), m add operations are required. 
Thus, the portion of the manual clean-up needed after applying the automatic 
matcher amounts to ( n ~ c )+( m ~ c ') 0 f the fully manual matching. 

We approximate the labor savings obtained by using an automatic mat- 
cher as accuracy of match result, defined as 1 — _ j n a perfect 

match, n = m = c, resulting in accuracy 1. Notice that — and - correspond 
to recall and precision of matching (Li and Clifton 2000). Hence, we can 
express match accuracy as a function of recall and precision as follows: 

Accuracy = 1 - ("- c )+("-°) = jl( 2 - ») = Recall ( 2 - — — 

In the above definition, the notion of accuracy only makes sense if preci- 
sion is not less than 0.5, i.e at least half of the returned matches are correct. 
Otherwise, the accuracy is negative. Indeed, if more than a half of the matches 
are wrong, it would take the user more effort to remove the false positives and 
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Table 9.1. Three plausible intended match results for matching problem in Fig. 7.1 
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add the missing matches than to do the matching manually from scratch. As 
expected, the best accuracy 1.0 is achieved when both precision and recall 
are equal to 1.0. Notice that accuracy is biased towards precision. For ex- 
ample, recall/precision measure (0.7, 0.9) corresponds to accuracy 0.62. This 
accuracy value is higher than that for (0.9, 0.7), which amounts to 0.51. 
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9.2 Intended Match Result 

Accuracy, as well as recall and precision, are relative measures that depend 
on the intended match result. For a meaningful assessment of match quality, 
the intended match result must be specified precisely. Recall our example de- 
aling with relational schemas that we examined in Sect. 7.1. Three plausible 
match results for this example (that we call Sparse, Expected, and Verbose) 
are presented in Table 9.1. A plus sign (+) indicates that the map pair shown 
on the right is contained in the corresponding desired match result. For exam- 
ple, map pair ([Table: Personnel], [Table: Employee]) belongs to both 
Expected and Verbose intended results. The Expected result is the one that 
we consider the most natural one. The Verbose result illustrates a scenario 
where matches are included due to additional information available to the 
human designer. For example, the data in table Personnel is obtained from 
both Employee and Department, although this is not apparent just by looking 
at the schemas. Similarly, the Sparse result is a matching where some cor- 
respondences have been eliminated due to application-dependent semantics. 
Keep in mind that in the Sparse and Verbose scenarios, the human selecting 
the “perfect” matchings has more information available than our matcher. 
Thus, clearly we cannot expect our matching algorithm to do as well as in 
the Expected case. 

Accuracy, precision, and recall obtained for all three intended results using 
version C of the flooding algorithm (see Table 7.3) are summarized in Fig. 9.1. 
For each diagram, we executed a script like the one presented in Sect. 7.1. 
The SelectThreshold operator was parameterized using f re ;-thresholcl values 
ranging from 0.6 to 1.0. As an additional last step in the script, we applied 
operator SQLDDLMapFilter that eliminates all matches except those bet- 
ween tables, columns, and keys. As shown in the figure, match accuracy 1.0 
is achieved for 0.95 < t re i < 1.0 in the Expected match, i.e., no manual ad- 
justment of the result is required from the user. In contrast, if the intended 
result is Sparse, the user can save only 50% of work at best. Notice that the 
accuracy quickly becomes negative (precision goes below 0.5) with decreasing 
threshold values. Using no threshold filter at all, i.e. t re i = 0, yields recall 
of 100% but only 4% precision, and results in a disastrous accuracy value of 
-2144% (not shown in the figure). Increasing threshold values corresponds 
to the attempt of the user to quickly prune undesired results by adjusting a 
threshold slider in a graphical tool. 

Fig. 9.1 indicates that the quality of matching algorithms may vary sig- 
nificantly in presence of different matching goals. As mentioned earlier, our 
definition of accuracy is pessimistic, i.e., the user may save more work as 
indicated by the accuracy values. The reason for that is twofold. On the one 
hand, if accuracy goes far below zero, the user will probably scrap the propo- 
sed result altogether and start from scratch. In this case, no additional work 
(in contrast to that implied by negative accuracy) is required. On the other 
hand, removing false positives is typically less labor-intensive than finding the 
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“sparse” “expected” “verbose” 






Fig. 9.1. Matching accuracy as a function of t re ;-threshold for intended match 
results Sparse, Expected, and Verbose from Table 9.1 



missing match candidates. For example, consider the data point t re i = 0.75 
in the Expected diagram. The matcher found all 6 intended map pairs (100% 
recall), and additionally returned 6 false positives (50% precision) resulting 
in an accuracy of 0.0. Arguably, removing these false positives requires less 
work than starting with a blank screen. 



9.3 User Study 

To evaluate the performance of the algorithm for schema matching tasks, 
we conducted a user study with help of eight volunteers in the Stanford 
Database Group. The study also helped us to examine how different filters 
and parameters of the algorithm affect the match results. For our study we 
used nine relatively simple match problems. The complete specification of 
the match tasks handed out to the users is in Appendix A. Some of the 
problems were borrowed from research papers (Miller et al. 2000; Doan et al. 
2001; Rahm and Bernstein 2001). Others were derived from data used on the 
websites like Amazon.com or Yahoo.com. Every user was required to solve 
tasks of three different kinds (shown along the x-axis of Fig. 9.2): 

1. matching of XML schemas (Tasks 1,2,3) 

2. matching of XML schemas using XML data instances (Tasks 4,5,6) 

3. matching of relational schemas (Tasks 7,8,9) 

The information provided about the source and target schemas was inten- 
tionally vague. The users were asked to imagine a plausible scenario and to 
map elements in both schemas according to the scenario they had in mind. No 
cardinality constraints were given (any [0,n] — [0,n] mapping was accepted). 
Noteworthy is that almost no two users could agree on the intended match 
result for a given matching task, even when examples of data instances were 
provided (tasks 4,5,6). Therefore, we could hardly expect any automatic pro- 
cedure to produce excellent results. From eight users, one outlier (i.e., the 
user with highly deviating results) was eliminated. The accuracy in percent 
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Fig. 9.2. Average matching accuracy for 7 users and 9 matching problems 



achieved by our algorithm (using fixpoint formula C) for each of the seven 
users and every task is summarized in Fig. 9.2. The accuracy metric was used 
to estimate the amount of work that a given user could save by using our 
algorithm. The accuracy data was obtained after applying SelectThreshold 
operator with t re i = 1. Negative accuracy of -14% in Task 3 indicates that 
User 1 would have spent 14% more work adjusting the automatic match result 
than doing the match manually. 

Note that in Task 1 the algorithm performed very well, while in Task 2 
the results were poor. It turned out that the models used in Task 2 had 
very simple structure, so that the algorithm was mainly driven by the initial 
textual match. We did not use any dictionaries for string matching in any of 
the experiments reported in this chapter. Hence, the synonyms used in Task 2 
were considered as plausible matches by humans but were not recognized by 
the algorithm. The matching accuracy over 7 users and 9 problems averaged 
to 52%. Hence, our study suggests that for many matching tasks, as much as 
a half of manual work can be saved using very little application-specific code. 
This figure is typically even higher in simpler tasks, e.g., when matching 
two XML documents conforming to the same DTD. Using synonyms may 
further improve the results of matching. For completeness, the sizes of graphs 
obtained from schemas used in the study are summarized in Table 9.2. 



Table 9.2. Sizes of graphs in the user study 



Task 


Edges in propaga- 
tion graph 


Edges/arcs in left 
model 


Edges / arcs in right 
model 


Ti 


128 


35/39 


32/37 


t 2 


313 


37/43 


40/46 


t 3 


376 


46/46 


49/52 


Ti 


383 


55/62 


39/44 


t 5 


309 


36/41 


48/48 


T e 


571 


66/55 


54/45 


t 7 


339 


33/31 


69/55 


T s 


1222 


113/78 


66/51 


t 9 


594 


113/78 


32/30 
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9.4 Evaluation of Algorithm and Filters 

Using matching accuracy as the quality measure, we utilized the data collec- 
ted in the user study to drive our evaluation and tuning of the algorithm 
for schema matching. As a result of this evaluation, we determined the pa- 
rameters of the algorithm and the filter that performed best on average for 
all users and matching problems in our study. The variations of the fixpoint 
formula that we used are depicted in Table 7.3 (compare Sect. 7.3). Using 
distinct fixpoint formulas results in different multimappings produced by the 
algorithm as well as different convergence speed. We then applied different 
filters to choose the best subsets of multimappings. Fig. 9.3 summarizes the 
accuracy (averaged over all tasks) obtained for every version of the algorithm 
and filter that we used. The filters were defined as follows: 




Fig. 9.3. Matching ac- 
curacy for different fil- 
ters and four versions of 
the algorithm 



— Threshold filter corresponds to the SelectThreshold operator described in 
Chap. 8. It produces mappings of cardinality [0,n] — [0,n] using relative- 
similarity threshold t re i = 1.0. 

— Exact is a [0, 1] — [0, 1] version of Threshold , which yields monogamous 
societies. 

— Best returns a [0, 1] — [0, 1] mapping using a selection metric that corre- 
sponds to the assignment problem. The implementation of the filter uses 
a greedy heuristic. For the next unmatched element, a best available can- 
didate is chosen that maximizes the cumulative similarity. 

— Left yields a [0,1] — [1,1] mapping, in which every node on the left is 
assigned a match candidate that maximizes the cumulative similarity. Right 
is a [1, 1] — [0, 1] counterpart of Left. 

— Outer filter delivers a [1 ,n] — [1 , n] mapping, in which every node on the 
left and on the right is guaranteed to have at least one match candidate. 

As suggested by Fig. 9.3, the best overall accuracy of 57.9% was achieved 
using Threshold filter with the fixpoint formula B. The accuracy of Thres- 
hold and Exact filters lie very close to each other. This is not surprising, 
since Threshold with t re i = 1.0 typically produces [0, 1] — [0, 1] mappings. In 
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our study, Right consistently outperforms Left , since in most matching tasks 
the right schemas were smaller; nodes in right schemas were therefore more 
likely to appear in the intended match results supplied by the users. Outer 
performed worst, since in many tasks only small portions of schemas were 
intended to have matching counterparts. 

We tried to estimate the usefulness of other filters, which are either hard to 
implement or require extensive computation, by using sampling. For example, 
a filter that returns a maximal matching (a [0, 1] — [0, 1] mapping with the 
most map pairs) is apparently not an optimal one for schema matching. 
Under formula B , the total number of map pairs in all tasks after applying 
the Best filter is 101, with associated accuracy of 40%. This accuracy value is 
lower than 54% obtained using the Exact filter that yields only 73 map pairs. 
Overall, our study suggests that preserving the stable-marriage property is 
desirable for selecting subsets of multimappings. 

Notice that the fixpoint formulae A, B , and C yield comparable mat- 
ching accuracy for each filter. However, formula C has much better conver- 
gence properties, as suggested by Table 9.3. The table shows the number n 
of iterations that were required in every task to obtain a residual vector 
\A(a n ,cr n ~ 1 )\ < 0.05. For every fixpoint formula, we executed the algorithm 
in two versions, “as is” and “strongly connected”. Strongly connected ver- 
sions guarantee convergence. This effect is achieved by making a 0 contain 
positive similarity values (e.g., at least 0.001) for each map pair in the cross- 
product of nodes of left and right schemas. We found experimentally that 
the strongly connected versions of the algorithm yielded approximately the 
same overall accuracy for the filters that preserve the stable-marriage pro- 
perty ( Threshold , Exact , and Best). In contrast, enforcing convergence had a 
substantial negative impact on accuracy for the filters Left, Right, and Outer. 
For a detailed discussion of convergence criteria please refer to Sect. 7.4. 



Table 9.3. Illustration of convergence properties of variations of fixpoint formula 
for tasks Ti , . . . , Tg in the user study. Shows iterations needed until length of residual 
vector got below 0.05. 



Fixpoint formula 


Ti 


t 2 


t 3 


t 4 


t 5 


T 6 


T 7 


t 8 


t 9 


Total 


A (as is) 


18 


48 


122 


78 


oo 


12 


37 


25 


25 


OO 


A (strongly conn’d) 


15 


56 


89 


81 


1488 


18 


48 


25 


31 


1851 


B (as is) 


8 


428 


17 


39 


8 


13 


10 


24 


21 


568 


B (strongly conn’d) 


7 


268 


21 


32 


13 


15 


14 


21 


53 


444 


C (as is) 


7 


9 


9 


11 


7 


7 


9 


10 


9 


78 


C (strongly conn’d) 


7 


9 


8 


11 


7 


5 


9 


7 


9 


72 



The formula for computing the propagation coefficients in the induced 
propagation graph is another important configuration parameter of the floo- 
ding algorithm. We experimented with seven distinct formulae and determi- 
ned the one that performed best in our user study. For the details of this 
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experiment please refer to Sect. 7.3. The best-performing formula is based on 
the inverse average of equilabeled edges in the graphs to be matched. This 
approach is similar to the one illustrated in Sect. 7.2, which corresponds to 
inverse product, and performs only slightly better. 

As a last experiment in this section, we study the impact of the initial 
similarity values (<j°) on the performance of the algorithm. For this purpose, 
we randomly distorted the initial values computed by the string matcher. The 
initial similarities were computed using two versions of a string matcher, one 
of which took term frequencies into account. Fig. 9.4 depicts the influence of 
randomization on matching accuracy across all users and matching tasks. For 
example, randomization of 50% means that every initial similarity value was 
randomly increased or decreased by x percent, x £ [—50%, 50%]. Negative 
similarity was adjusted to zero. It is noteworthy that a randomization factor 
of 100% introduced accuracy penalty of just about 15%. This result indicates 
that the similarity flooding algorithm is relatively robust against variations 
in seed similarities. The dotted lines show another radical modification of 
initial similarities, in which each non-zero value in a 0 was set to the same 
number computed as the average of all positive similarity values. In this case, 
the accuracy dropped to 30%, which still saves the users on average one third 
of the manual work. 




w / freq 

w/o ffeq 

avg w/ ffeq 

— — avg w/o ffeq 



Fig. 9.4. Impact of ran- 
domizing initial similari- 
ties on matching accu- 
racy 



9.5 Propagation Coefficients 

The similarity flooding algorithm offers several tuning parameters. One such 
parameter is the definition of the function n that computes the propagation 
coefficients in the propagation graph. Above we presented a product-based 
definition of ir that we used to illustrate the algorithm in Sect. 7.2. In our 
user study we found empirically that an average-based definition of 7 r slightly 
outperformed the product-based one. The average-based 7r-formula as well as 



156 9. Evaluation and Tuning 



another six approaches to computing the propagation coefficients that we 
examined are summarized in Table 9.4. 

For example, the stochastic formula ensures that the sum of propagation 
coefficients on all edges originating from each map pair in the propagation 
graph is 1.0. Hence, the transition matrix that corresponds to the propagation 
graph (see Sect. 7.4) becomes a stochastic matrix, i.e., the entries in each 
column sum to 1. We evaluated the performance of each 7r-function listed 
in the table using the data obtained in the user study. Fig. 9.5 summarizes 
the accuracy values obtained using different n- functions. In this experiment, 
we used the fixpoint formula B of Table 7.3 and filters Threshold and Best 
to determine the overall average accuracy. We found that the constant n- 
function, which places weights of 1.0 on each edge of the propagation graph, 
performs surprisingly well as compared to more sophisticated approaches. 
We did not extensively examine other 7r-functions that take into account 
edge-label similarities, i.e., those that return a non-zero value when p ^ q. 




invavg invprod inv total inv total combined stochastic constant 
ayg prod 



Fig. 9.5. Impact of dif- 
ferent ways of compu- 
ting propagation coeffi- 
cients on overall mat- 
ching accuracy in the 
user study 



9.6 Conclusions and Open Issues 

In Part III, we presented a simple structural algorithm based on fixpoint com- 
putation that is usable for matching of diverse data structures. We illustrated 
the applicability of the algorithm to a variety of scenarios. We defined several 
filtering strategies for pruning the immediate result of the fixpoint computa- 
tion. We suggested a novel quality metric for evaluating the performance of 
matching algorithms, and conducted a user study to determine which configu- 
ration of the algorithm and filters performs best in chosen schema matching 
scenarios. We discussed the convergence and complexity of the algorithm. 
The main results of our study were the following: 

— For an average user, overall labor savings across all tasks were above 50%. 
Recall from Chap. 9 that our accuracy metric gives a pessimistic estimate, 
i.e., actual savings may be even higher. 
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Table 9.4. Different approaches to computing the propagation coefficients 
A), ( y,q,B )) 
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— A quickly converging version of the fixpoint formula ( C ) did not introduce 
accuracy penalties. 

— Threshold filter performed best. 

— The best formula for computing the propagation coefficients was the one 
based on inverse average (Sect. 7.3). 

— The flooding algorithm is relatively insensitive to “errors” in initial simi- 
larity values. 

By studying various model-management scenarios, we found that the SF 
algorithm performs particularly well if the input schemas are two versions of 
the same schema, with minor variations. In such cases, the algorithm often 
produces an output that does not need any manual post-processing. 

Below we summarize the limitations of the algorithm and several open 
issues that need to be investigated. This list is by no means exhaustive: 

1. The algorithm works for directed labeled graphs only. It degrades when 
labeling is uniform or undirected, or when nodes are less distinguishable. 
For example, the algorithm does not perform well for solving the graph 
isomorphism problem on undirected graphs having no edge labels. 

2. Applicability of the algorithm is limited to equityped models. While mat- 
ching of an XML schema against another XML schema delivers usable 
results, matching of a relational schema against an XML schema fails. 

3. An important assumption behind the algorithm is that adjacency con- 
tributes to similarity propagation. Thus, the algorithm will perform un- 
expectedly in cases when adjacency information is not preserved. For 
example, in HTML pages nodes that are structurally far away from each 
other may be displayed visually close. Thus, two cells in an HTML table 
that are vertically adjacent may be far apart in the document and won’t 
contribute to similarity propagation. 

4. The algorithm tends to favor superstructures. Consider graph A contai- 
ning subgraph A\. Let graph B contain a superstructure B 1 such that 
A C B\ and a substructure B 2 such that B 2 C A. The algorithm would 
favor Hi as a match candidate for A, i.e., similarity values between nodes 
in A and B 1 will be higher that those between A and B 2 . 

5. Currently, we do not consider order and aggregation in the algorithm. 
It is possible that matching of XML schemas could benefit from taking 
XML features into account. 

6. The distribution of the similarity values produced by the fixpoint compu- 
tation is non-uniform. It may be difficult to combine the algorithm with 
the matching techniques that rely on absolute similarity values such as 
those presented in (Do and Rahm 2002). 

7. It is unlikely that a standalone version of the algorithm could outper- 
form custom matchers developed for a particular domain. Custom mat- 
chers may deploy domain-specific heuristics that are not available to the 
similarity flooding algorithm (e.g., value ranges, cardinalities, classifiers 
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etc.). However, the algorithm can make use of custom import and export 
filters, such as XMLMapFilter mentioned in Sect. 7.5, to prototype a 
first-cut version of a specialized matcher quickly. 

A detailed examination of the related work on matching algorithms and 
evaluation of matching techniques in presented in Sect. 10.2. 



Model Management in Perspective 




10. Related Work 



“The secret to creativity is knowing how to hide your sources.” 
- Albert Einstein (1879-1955) 



The work on metadata management looks back onto over three decades of 
prolific research efforts ranging from the invention of database schemas (Mc- 
Gee 1959) to database design (Wiederhold 1977), from storing schemas and 
queries as first-class objects (Stonebraker et al. 1984; den Bussche et al. 1993; 
Lakshmanan et al. 2001), to transforming them using complex algorithms 
(Abiteboul et al. 1995; Halevy 2001). Generic model management is, howe- 
ver, a quite recent approach to metadata management. Its goal is to factor 
out the similarities of the metadata problems studied in the literature and 
develop a set of high-level operators that can be utilized in various scena- 
rios. The set of operators that we examined was inspired by the vision and 
model management scenarios presented in (Bernstein et al. 2000b; Bernstein 
and Rahm 2000; Bernstein 2003) and can be traced back to the early work 
(Wiederhold 1994). 

In this chapter, we survey the literature that motivated the operator de- 
finitions, algorithms, and scenarios presented in the dissertation: 

— In Sections 10.1-10.5 we examine the major metadata problems that un- 
derpin the operators Merge, Match, Compose, Extract, and Diff. These pro- 
blems are data integration, schema matching, mapping composition, view 
selection, and view complement, respectively. 

— In Sect. 10.6, we discuss the work that exploited state-based semantics and 
show how our approach can be viewed in category-theoretic terms. 

— In Sect. 10.7, we discuss the metadata management capabilities of today’s 
repository systems. 

— In Sect. 10.8, we review two metadata-intensive applications, declarative 
mediation and change propagation. 

— In Sect. 10.9, we briefly cover other related work such as data translation, 
mapping tables, and the Z method. 



S. Melnik: Generic Model Management, LNCS 2967, pp. 163-197, 2004. 
© Springer- Verlag Berlin Heidelberg2004 
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10. Related Work 



10.1 Data Integration and Merge 

Data integration is probably the most widely known metadata-centric area 
of database research. Data integration comprises a whole range of problems 
that arise when heterogeneous data needs to be stored or manipulated in a 
uniform fashion. A characteristic property of data integration scenarios is the 
presence of a number of heterogeneous schemas, called local schemas, and a 
unified, integrated schema, called global schema (Lenzerini 2002). The data 
itself can be stored either in local databases, in a global database, or at both 
places. Either the local schemas or the global schema can be used to query 
or update the data. Depending on where the data actually resides, query 
and update rewriting may be necessary. When the data is stored in the local 
databases, the latter are often called (data) sources. 

At least three major data integration scenarios can be distinguished in 
the literature depending on the integration goals, location of data, target of 
queries, etc. (Wiederhold 1977; Batini et al. 1986; Davidson et al. 1995a): 

— In view integration , a global schema is produced for a number of local 
schemas. The local schemas capture the individual requirements of diffe- 
rent user groups. The global schema must be capable of representing the 
complete information of each local schema to satisfy the requirements of 
each user group. In other words, each local schema must be definable as a 
view on the global schema. The users do not adopt the global schema for 
their applications but run queries and updates through the local schemas 
defined as views over the global schema, i.e., query rewriting is required. 
Moreover, in some cases the local databases already contain data before 
the integration takes place. This data needs to be physically migrated to 
the global database. 

— The aim of database integration is to provide a uniform view on a set of 
local databases, or sources. That is, the data remains at the sources but is 
queried and updated via the global schema. The global schema may cover 
all information of the local schemas or, more typically, only a fragment 
relevant for a particular application. 

— In data warehousing , data is stored both at the sources, which are called 
operational databases, and in the global database, or data warehouse. The 
content of the data warehouse typically contains a portion of the opera- 
tional data but may also store historical information that is not present 
in the sources any more. Online analytical processing (OLAP) queries are 
run against the global database and need not be rewritten. The updates on 
sources are either propagated to the warehouse in batches or are rewritten 
and executed directly on the warehouse if the warehouse is incrementally 
updatable. 

The properties of the aforementioned scenarios are summarized in Ta- 
ble 10.1. Letters L and G stand for “local” and “global”, respectively. The 
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Table 10.1. Data integration scenarios 



Scenario 


Location 
of data 


Target of 
queries or 
updates 


Direction 
of map- 

pings 


Coverage 
of global 
schema 


View integration 


(L — >) G 


L 


LAV 


complete 


Database integration 


L 


G and L 


GLAV 


partial 


Data warehousing 


L and G 


G and L 


GAV 


partial 



terms LAV, GAV, and GLAV abbreviate local-as-view, global-as-view, and 
global-local-as-view, respectively. They are discussed in Sect. 10.1.2. 

Across all these scenarios, several common tasks can be identified: 

1. Constructing a global schema for a given set of local schemas, known as 
schema integration. 

2. Constructing mappings between the given global schema and local sche- 
mas or among the local schemas, studied in the context of schema mat- 
ching. 

3. Answering queries and performing updates on the local databases via 
the global view or on the global database via the local views. This task 
has been studied in the context of answering queries using views and the 
view update problem. 

In this section, we focus on schema integration and answering queries using 
views. We review the schema matching problem in Sect. 10.2. In Sect. 10.8.1, 
we consider another aspect closely related to data integration, namely inte- 
gration of information processing services. 

10.1.1 Schema Integration 

The operator Merge is designed to be used as a principal schema integration 
operator. We derived the signature and the definition of the semantics of 
the operator from various approaches to schema integration suggested in the 
literature. 

Inputs and Outputs of Schema Integration. One of the key observations that 
surfaced in most of approaches is that schema integration is driven by a formal 
description of the relationship between the local schemas (input schemas). 
This relationship is called “interrelational dependencies” in (Casanova and 
Vidal 1983), “integration constraints” in (Biskup and Convent 1986), and 
“inter-schema correspondences” in (Spaccapietra and Parent 1994). Davidson 
et al. (1995a) argued that the relationship between the input schemas needs 
to be specified using a complex mapping language. Thus, our operator Merge 
takes as input the schemas and a mapping between the schemas. The output 
of the operator includes the mappings between the integrated schema and 
the input schemas, just as in (Spaccapietra and Parent 1994). 
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Some approaches to schema integration, such as (Buneman et al. 1992), 
make a simplifying assumption that the input mapping is given implicitly 
through syntactic equality of the elements (classes, attributes, etc.) that oc- 
cur in the input models. Yet other techniques, e.g., (Gotthard et al. 1992; Noy 
and Musen 2000), allow the engineer to construct the input mapping during 
the merging process, i.e. , include a built-in matching step. Taking a mapping 
as an input parameter to Merge offers a more general approach, which faci- 
litates the development of independent algorithms for Match and Merge and 
helps formalizing implicit assumptions. Still, as we point out below, the two 
operations may sometimes be hard to separate from each other. 

Complete vs. Partial Global Schema. Batini et al. (1986) observe that the 
process of constructing the global schema is often performed in two phases, 
by first obtaining a union of the source schemas and then performing re- 
structuring operations on the result. The idea behind “unioning” the source 
schemas is to ensure that the global schema preserves all information of the 
source schemas. As we explained above, this is precisely the objective in the 
view integration scenarios. We followed this intuition in defining the seman- 
tics of the operator Merge. 

In contrast, in database integration the goal is to provide access to several 
existing, autonomous databases. In this setting, the global schema may pro- 
vide any conceivable (partial) view on the local schemas, i.e., its construction 
is driven by the application requirements and cannot be fully automated. 
The trend to focus on importing and integrating selected portions of source 
databases, as in the mediation and federated architectures, has been obser- 
ved in (Hull 1997). Notice that the selected portions of local schemas can 
be characterized by defining a view Vi on each local schema Si such that Vi 
exposes all information of Si relevant for the given application. Then, the 
extracted view schemas can be taken as inputs of the Merge operator. Hence, 
it seems that the objective of capturing all information is fundamental to 
schema integration and can be exploited even in mediation and federated 
architectures. In several other schema integration tasks the objective is to 
construct a global schema that exposes only the “overlapping” information 
of the source databases. This case corresponds to the operator Intersect that 
we define in Sect. 11.3.3. The distinction between integrating all vs. over- 
lapping information was emphasized and studied in (Buneman et al. 1992; 
Wiederhold 1994). 

Minimality of Global Schema. The purpose of restructuring of the global 
schema that is done in various methodologies is to eliminate the redundancy 
caused by the fact that the local schemas are not disjoint and to obtain 
a “minimal” global schema that still captures the complete content of the 
local schemas. We formalized this idea in condition (iv) of Definition 4.2.4. 
A variety of transformation heuristics, rules, or restructuring primitives have 
been suggested to obtain “smaller” or “better” schemas (Casanova and Vidal 
1983; Buneman et al. 1992) . A common requirement on such transformation 
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rules is information preservation. For example, in (Spaccapietra and Parent 
1994) the rule definitions are based on the principle that whenever there is a 
conflict between two structures of the schemas, the integrated schema holds 
the more unconstrained structure. 

Casanova and Vidal (1983) describe a heuristic optimization procedure 
that tries to reduce redundancy and the size of the output schema produced 
by merging. They consider a schema language with a rich set of constraints 
such as inclusion, exclusion and functional dependencies and observe that 
constructing minimization procedures for such a language is very hard due 
to the interaction of the dependencies and the complexity of the inference 
problem for inclusion dependencies. 

Uniqueness of Merge Result. Ideally, the outcome of the Merge operation 
should not depend on the particular merge algorithms, on the choice of 
schema restructuring primitives, or on the order in which the local schemas 
are merged. In other words, the input schemas and mapping should uniquely 
determine the merged schema and the mappings from the merged schema 
to the input schemas. Indeed, the definition of Merge given in Sect. 4.2.4 
suggests that Merge could be a fully automatic operation, provided that all 
potential conflicts are resolved in the input mapping. And yet, as observed 
in (Rosenthal and Reiner 1994), much of the research literature on schema 
integration consists of careful case-by-case heuristics for resolving particu- 
lar types of mismatches between schemas, such as conditionally mergeable 
relationships. That is, the result of Merge may vary substantially from one 
methodology to another. 

The reason for this discrepancy becomes clearer if we examine the work 
such as (Casanova and Vidal 1983; Biskup and Convent 1986; Buneman et al. 
1992; Rosenthal and Reiner 1994), which give a rigorous theoretical treatment 
of schema integration. Assuming that the mapping between the input sche- 
mas has been agreed upon, a primary source of difficulties seems to be due 
to the limitations of the schema language used for representing the merged 
schema. 

For example, Biskup and Convent (1986) suggested a notation quite simi- 
lar to our Merge operator to denote the immediate result of schema integra- 
tion: Comb(v i , ... ,v n , I), where u,; are view schemas and / is a set of integra- 
tion constraints (input mapping). Essentially, Comb { ) describes a trivial way 
of merging the schemas by simply including the integration constraints spe- 
cified in the input mapping into the definition of the merged schema, just as 
suggested in Theorem 4.2.4. The authors observed that Combi) is in general 
not a valid database schema because it may contain constraints that cannot 
be represented in the schema language. The goal of the subsequent manipu- 
lation of Combi) is to find a valid schema G that contains no integration 
constraints and is equally expressive as Comb). They present a pseudocode 
algorithm for eliminating integration constraints step by step. Since it may 
not be possible to rewrite all integration constraints into integrity constraints 
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in G, the authors acknowledge that G may have to be more expressive than 
Comb(). Obviously, there is substantial flexibility in choosing G , which in 
part explains various heuristics and restructuring primitives that have been 
examined in the literature. 

As another example, consider commutativity and associativity of Merge, 
highly desirable properties (see e.g., (ElMasri 1980)) that are hardly ever 
satisfied in existing schema integration techniques. Associativity of Merge 
was discussed in (Buneman et al. 1992) for a simple object-oriented schema 
language with generalization. The authors argued that the merged schemas 
may contain so-called implicit classes in addition to the classes of the sche- 
mas being merged. If these implicit classes are made explicit in the result 
(similar to replacing an existential formula by a constant), Merge becomes 
non-associative. To overcome this problem, the authors extended their schema 
language to accommodate the implicit classes. In this example, the limited 
expressiveness of the schema language was a major obstacle in making merge 
associative. 

Across many techniques, the variability of results of schema integration 
is often due to the need to materialize the merged schema in some target 
schema language. The evaluation of the merge algorithms and heuristics 
proposed in the literature is an open problem, which is largely due to the 
lack of established quality metrics and formal requirements. 

Schema and Mapping Languages. A variety of schema and mapping langu- 
ages have been used for schema integration. Relational schemas are consi- 
dered in (Casanova and Vidal 1983; Biskup and Convent 1986). Buneman 
et al. (1992) use a simple object-oriented schema language. The work in 
(Miller et al. 1994; Spaccapietra and Parent 1994; Pottinger and Bernstein 
2003) uses other quite different schema languages that are reminiscent to the 
Entity-Relationship (ER) schema language. Noy and Musen (2000) focus on 
merging of ontologies. 

The spectrum of utilized mapping languages is also quite broad. In (Ca- 
sanova and Vidal 1983; Biskup and Convent 1986; Spaccapietra and Parent 
1994), the mapping language consists of a set of element correspondence as- 
sertions on entities or relational attributes. The assertions include equality 
of value set, containment (inclusion dependencies), non-empty intersection, 
disjointness, or functional dependencies. The morphisms, our mapping lan- 
guage discussed in Sections 2.2.2 and 6.1, can be seen as a simple variant of 
the languages suggested in these approaches, in which only equality of value 
sets is utilized. 

The mapping language of Pottinger and Bernstein (2003) allows descri- 
bing the structural relationships between the elements of the input schemas 
and can be utilized for specifying a priori how the structural conflicts are 
to be resolved. Davidson et al. (1995a) suggested a very powerful mapping 
language WOL (Well-founded Object Language) . They motivate the need for 
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such a language using schema integration scenarios but do not address the 
computation of the merge result. 

Separability of Match and Merge. The order of the unioning and restructu- 
ring phases used in schema integration approaches varies between integration 
methods. For example, in (Motro 1987) the union is carried out first to ob- 
tain a “superview” of the sources, which is then restructured into a final 
shape. On the other hand, in (Buneman et al. 1992; Spaccapietra and Pa- 
rent 1994) the information on how the schemas are to be reshaped is part of 
the input, i.e., in a way restructuring happens before unioning of schemas. 
In (Gotthard et al. 1992; Noy and Musen 2000), the user is asked questions 
such as whether two given structures are similar and whether they conflict 
during the merging process. In (Pottinger and Bernstein 2003) and in our 
GraphMerge algorithm of Sect. 3.2.7 restructuring information comes in part 
from the input mapping and in part from the information obtained during 
semiautomatic conflict resolution. 

The input mapping used for schema integration is obtained by matching 
the input schemas using the operator Match. The benefit of separating mat- 
ching and merging is that these two operators can be studied independently, 
and the algorithms developed for matching or merging can be used across 
different approaches to schema integration. However, this separation is not 
always possible. As we argue in Sect. 11.3.4, in some cases two schemas can 
be only related by way of a third schema and two mappings. That is, the 
problem of constructing the global schema may be very closely intertwined 
with that of matching the local schemas. Moreover, binary mappings may 
be insufficient to describe the relationships between more than two input 
schemas. Furthermore, in practice it may be easier to merge schemas step by 
step, resolving conflicts on demand, as compared to specifying the complete 
conflict resolution information a priori in a matching step. It may however 
be possible to specify such a stepwise schema integration approach using a 
model-management script that uses the elementary operators Extract, Match 
and Merge. 

Merging Algorithms. Much of the work on schema integration focused on the 
design of algorithms that satisfy certain desirable properties, such as infor- 
mation preservation (see e.g. (Spaccapietra and Parent 1994)). In contrast, 
the approach exploited in Rondo and in (Pottinger and Bernstein 2003) aims 
at simplifying the implementation of the Merge operator for various kinds of 
models. In the GraphMerge algorithm of Sect. 3.2.7 and the algorithm pre- 
sented in (Pottinger and Bernstein 2003), the input schemas are represented 
and manipulated as graphs. The attractiveness of this approach is that the al- 
gorithms can be easily adapted for new kinds of models, either by tuning the 
conflict resolution rules (function conflictsWith in Sect. 3.2.7) or by encoding 
the conflict-resolution strategy in the input mapping (e.g., using direction of 
morphism arcs in Rondo or structure of input mappings in (Pottinger and 
Bernstein 2003)). 
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The generic merging algorithm proposed by Pottinger and Bernstein 
(2003) is similar in spirit to the GraphMerge algorithm. Instead of simple 
morphisms, their algorithm takes as input a mapping with a complex in- 
ternal structure, which can be exploited for describing how the structural 
conflicts of the input models are to be resolved. The authors demonstrate 
that the algorithms developed in (Spaccapietra and Parent 1994; Noy and 
Musen 2000) and the GraphMerge algorithm can be implemented by adap- 
ting their generic merging algorithm. They observe that an approach based on 
a common meta-meta model is very flexible and can accommodate virtually 
all merging techniques proposed in the literature. 

Indeed, as long as the schemas and merging rules are relatively simple it 
is easy to see how a structured mapping and simple conflict resolution rules 
can drive the transformation of the input schemas represented as graphs into 
an output schema. However, in presence of non-trivial schema constraints or 
an expressive mapping language the conflict resolution rules may become ar- 
bitrarily complex and the benefit of a generic merging algorithm diminishes. 
For example, it is unlikely that the algorithm and the axiom system presen- 
ted in (Casanova and Vidal 1983) can be expressed using simple structural 
primitives. 

10.1.2 Answering Queries Using Views 

When the global schema, the local schemas, and the mappings between them 
are given, the major remaining task is to rewrite the queries stated in terms 
of the global schema into queries on local schemas, or the other way around. 
The problem of answering queries using views can be stated in a relatively 
straightforward fashion in terms of state-based semantics. However, com- 
puting the rewritten queries can be extremely hard for concrete mapping 
languages. The purpose of this section is to illustrate that materializing the 
mappings that are produced in model-management scripts can be very chal- 
lenging, yet there has been substantial research that we can build upon. A 
recent survey of approaches to answering queries using views is presented in 
(Halevy 2001). 

Equivalent and Maximally-Contained Rewritings. In its simplest setting, the 
problem of answering queries using views can be formulated as follows. Given 
a view m_m x on m , rewrite query q on m into a query q' on m x such that 
the result of q' is identical to the result of q or, if this is not possible, the 
result of q' is a maximal result that is contained in the result of q. Thus, the 
two major problems studied in the context of answering queries using views 
are the equivalent rewriting problem and the maximally-contained rewriting 
problem. Equivalent rewriting is a prerequisite of query optimization (see 
e.g. (Goldstein and Larson 2001)), whereas in a data integration setting, 
equivalent rewriting is rarely achievable and one frequently has to settle for 
maximally-contained rewritings. 
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An equivalent rewriting exists only when the condition q = m_m x o 
Invert (m_m x ) o q holds. In this case, the rewritten query q' on m x is 
q' = lnvert(m_TO x ) o q. However, in general q' may be non- functional and 
only a weaker condition q C m_m x o \rw/ert(m_m x ) o q is satisfied (assuming 
that m_m x is total). That is, for a fixed x £ m x there are multiple possible 
results ri,...,Tk of q' such that ( aqrq ) £ q' . Each possible result r* corre- 
sponds to executing q on some source state yi £ m with (j/j, x) £ m_m x . The 
intersection r = n D . . . fl r*, of all possible results of q' yields the certain 
result for x. The certain result is the maximal result that is contained in each 
result q[yi\. The query q c that returns the certain result for each x £ m x is a 
maximally-contained rewriting of q. 

The concept of a certain answer was originally introduced in (Abiteboul 
and Duschka 1998) for a relational setting. A certain answer is a tuple that 
occurs in each possible result r^. What we call a certain result above corre- 
sponds to the set of all certain answers in the relational case. A certain result 
rq fl . . . fl rq may not be equal to one of the possible results rq, i.e. , in general 
q c % lnvert(m_m x ) o q = q' . 

Notice that the notion of a certain result depends on a concrete schema 
language, whereas the notion of a maximally-contained rewriting depends, in 
addition, on a concrete mapping language. For example, Halevy (2001) ob- 
serves that a rewriting is maximally-contained only with respect to a specific 
query language; there can sometimes be a maximally-contained query in a 
more expressive language that provides more answers. Thus, for certain query 
languages a maximally-contained rewriting may not yield all certain answers. 
Similarly, the notion of a certain result depends on the instance-containment 
relationship which varies among schema languages, as we stress below when 
we discuss query containment. Thus, just as with other model-management 
scenarios that we considered, the hardness of the problem of answering queries 
using views is due to materialization of rewritten queries in concrete mapping 
languages. For many important schema and mapping languages, the problem 
is NP-hard or even undecidable (see complexity summary in (Abiteboul and 
Duschka 1998; Lenzerini 2002)). 

In practice, many factors need to be considered for choosing an “opti- 
mal” rewriting (if it is computable). For example, instead of a certain result 
we may be more interested in computing a maximal partial result that can 
be obtained from a set of distributed sources under given time constraints. 
Or, if the goal is equivalent rewriting, we may want to find the “cheapest” 
equivalent rewriting, especially in the context of query optimization. 

Query Containment. The problem of finding a maximally-contained query 
rewriting q c is closely related to the query containment problem (Ullman 
1997). A maximally-contained query q c is defined as a query on m x that is 
“contained” in g, denoted as q c C c q , and is maximal with respect to C c , 
i.e. , q c C c tC c q implies t = q c for each query t. The algorithms for query 



172 



10. Related Work 



containment do not provide a means for computing q c but help verify whether 
a given q c is a candidate rewriting. 

Strictly speaking, the query containment problem deals with the contain- 
ment of query results, rather than queries. Formally, q± C c q 2 iff \/x G m : 
< 7 i [rc] Cj q 2 [x]. The result-containment relationship C c is defined in terms of 
the instance-containment relationship C,. The relationship C i is a transiti- 
vely closed reflexive relation on to x to and can be defined in various ways 
for concrete schema languages. For instances of relational schemas, X\ C, : x 2 
typically indicates that the set of tuples of X\ is contained in the set of tuples 
of x 2 . However, C, has to be defined differently for relations with multiset se- 
mantics, instances of object-oriented database schemas, or XML documents. 
In these cases, C, may be based on a sublist, subtree, or graph embedding 
relationship on instances. 

Several approaches in the literature use alternative, more general notions 
of query containment. For example, Li et al. (2001) suggest the notion of 
“p-containment” meaning that the result of a query qi p-contained in q 2 can 
be computed from the result of q 2 using a third query /, i.e., q\ = q 2 o f. 
In a simplest case, which is often assumed for relational schemas, / is a 
selection query. A similar idea is presented in (Bancilhon and Spyratos 1981). 
There, the authors use the condition \/x,y € m : q 2 [x\ = q 2 [y] => qi[x\ = 
qi[y\- Whenever it holds, they say that q 2 “determines” q±. Their definition 
is subsumed by the condition qi = q 2 o f . The advantage of using these 
more general notions of query containment is that they can be characterized 
in state-based semantics without appealing to a containment relationship 
on instances that needs to be defined separately for each concrete schema 
language. 

The complexity of the query containment problem for different query 
languages, such as conjunctive queries, queries with negation, Datalog, etc. 
is summarized in (Ullman 1997). For example, containment for conjunctive 
queries is NP-complete, whereas containment for Datalog programs is unde- 
cidable. 

GAV, LAV, and GLAV Mappings. In a data integration setting, the relati- 
onship between the local schemas and the global schema may be expressed 
using the so-called LAV, GAV, and GLAV mappings, which are distinguished 
based on the functional properties of the mappings (compare Table 10.1 on 
page 165): 

— Global-as-view (GAV): there is a view that defines the content of the global 
schema based on the content of the sources, as in data warehousing. 

— Local-as-view (LAV): for each local schema L, there is a view on the global 
schema that defines the content of L, as in view integration. 

— Global-local-as-view (GLAV): the relationship between the local schemas 
and the global schema is established using a combination of GAV and LAV 
assertions, i.e., the mappings between the local schemas and the global 
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schema and their inverse mappings may be non-functional. This is a general 
case in database integration. 

Distinguishing between GAV, LAV, and GLAV mappings led to different 
query rewriting algorithms with varying computational properties. 

Sound, Complete, and Exact Views. Another distinction often made in the 
literature is that between sound, complete, and exact views (Lenzerini 2002). 
Strictly speaking, this terminology finesses the fact that different, non- 
functional mapping languages are used to characterize the relationship bet- 
ween the global schema and the local schemas. To illustrate, let v be a view 
on m, i.e. , v is a total functional mapping from to to m' . Assume that we 
are given a valid “snapshot” of the application state, i.e., we have an in- 
stance x G to of the source schema and an instance y £ m! of the view. 
The view v is said to be exact when for all such snapshots v\x] = y, sound 
when v[x\ Dj y and complete when v[x\ C, y. The relationship C, describes 
containment of instances, as explained above. (The case v\x] Dj y is referred 
to as “open- world assumption” in (Abiteboul and Duschka 1998).) 

Technically, each of the above “views” can be represented as a mapping 
map defined as map = v for exact views, (x,y) £ map v[x\ Dj y 

for sound views, and ( x,y ) £ map •<=>■ u[a:] Cj y for complete views. For 
example, let v = «7ta(R) = S». A mapping that describes v as a sound view 
is map = «7ta(R) D S», i.e., the mapping holds if and only if all tuples of S 
are contained in 7ta(R). This mapping is non-functional. 

In this section, we have shown that several well-known hard problems can 
be characterized relatively easily using the state-based approach. What makes 
these problems hard is the need to express the mappings and models whose 
properties are specified in an abstract fashion using concrete languages. This 
is the objective of materialization (see Sect. 4.3). Materialization seems to be 
one of the major upcoming challenges for model management. Fortunately, 
there is excellent prior work to build on. 



10.2 Schema Matching and Match 

Similarity Flooding. In the model-management prototype that we presented 
in this dissertation, the operator Match is implemented using the Similarity 
Flooding (SF) algorithm of Chap. 7. In designing the SF algorithm and the 
filters, we borrowed ideas from three research areas. The fixpoint computation 
corresponds to random walks over graphs (Motwani and Raghavan 1995), as 
explained in Sect. 7.4. A well-known example of using fixpoint computation 
for ranking nodes in graphs is the PageRank algorithm used in the Google 
search engine (Brin and Page 1998). Unlike PageRank, our algorithm has two 
source graphs and extensively uses and depends on edge labeling. The filters 
that we proposed for choosing subsets of multimappings are based on the 
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intuition behind the class of stable marriage problems (Gusfield and Irving 
1989). General matching theory and algorithms are comprehensively covered 
in (Lovasz and Plummer 1986). 

Since the SF algorithm was published (Melnik et al. 2002), a number 
of other research efforts that exploit a similar idea have emerged in the li- 
terature. For example, Jeh and Widom (2002) examine the case where all 
arcs bear the same label and represent document citations links. Anyanwu 
and Sheth (2002) use the intuition of contextual similarity to discover asso- 
ciations on the Semantic Web. Noy and Musen (2002) present an algorithm 
for comparing ontology versions, which uses fixpoint computation. Goldstone 
and Rogosky (2002) exploit unlabeled relations in graphs to translate across 
conceptual systems using the intuition of contextual similarity that underlies 
the SF algorithm. They draw very intriguing conclusions regarding cognitive 
processes that take place in establishing relationships between different con- 
ceptual models. In particular, their work supports the claim made by some 
philosophers and cognition scientists that it is often possible to translate bet- 
ween the concepts of two conceptual systems by exploiting only intrinsic, 
within-system relations and no or little extrinsic grounding. 

We investigated the application of the SF algorithm for schema matching. 
However, SF is a general-purpose graph matching algorithm and graph mat- 
ching has many other applications. As we observed in Chap. 7, the algorithm 
may be utilized for computing schema correspondences using instance data, 
and for finding related elements in data instances. In fact, matching of data 
instances is a promising application area. For example, consider two CAD files 
or program scripts that have been independently modified by several develo- 
pers. In this scenario, matching helps to identify moved or modified elements 
in these complex data structures. In bioinformatics, matching has been used 
for network analysis of molecular interactions (Ogata et al. 2000; Kanehisa 

2000) . In this domain, data instances represent e.g. metabolic networks of 
chemical compounds, or molecular assembly maps. Matching of molecular 
networks and biochemical pathways may help predict metabolism of an or- 
ganism given its genome sequence. We are aware of ongoing work by other 
researchers who are applying a variation of the SF algorithm to data struc- 
tures used in computer graphics, semantic integration of spatial data, and 
bioinformatics . 

Related Approaches. The SF algorithm is only one possible implementation 
for the Match operator. Various systems and approaches have recently been 
developed to determine mappings between schemas (semi-) automatically, 
e.g., Autoplex (Berlin and Motro 2001), Automatch (Berlin and Motro 2002), 
Clio (Yan et al. 2001; Miller et al. 2001), COMA (Do and Rahm 2002), Cu- 
pid (Madhavan et al. 2001), Delta (Clifton et al. 1997), DIKE (Palopoli et al. 
2003), EJX (Embley et al. 2001), GLUE (Doan et al. 2002), LSD (Doan et al. 

2001) , MOMIS (and ARTEMIS) (Bergamaschi et al. 2001; Castano and An- 
tonellis 1999), Semlnt (Li and Clifton 2000), SKAT (Mitra et al. 1999), and 
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TranScm (Milo and Zohar 1998). While most of them have emerged from the 
context of a specific application, a few approaches (Clio, COMA, Cupid) try, 
just as SF, to address the schema matching problem in a generic way that is 
suitable for different applications and schema languages. A taxonomy of auto- 
matic match techniques and a comparison of the match approaches followed 
by the various systems is provided in (Rahm and Bernstein 2001). Accor- 
ding to their taxonomy, Similarity Flooding can be classified as a structural, 
l:l-local, m:n-global matching algorithm. 

To identify a solution for a particular match problem, it is important to 
understand which of the proposed techniques performs best, i.e. , can reduce 
the manual work required for the match task at hand most effectively. To 
show the effectiveness of their system, the authors have usually demonstrated 
its application to some real-world scenarios or conducted a study using a 
range of schema matching tasks. Unfortunately, the system evaluations were 
done using diverse methodologies, metrics, and data making it difficult to 
assess the effectiveness of each single system, not to mention to compare 
their effectiveness. Furthermore, the systems are usually not publicly available 
making it virtually impossible to apply them to a common test problem or 
benchmark in order to obtain a direct quantitative comparison. 

To obtain a better overview of the current state of the art in evaluating 
schema matching approaches, we reviewed the recently published evaluations 
of the schema matching systems in (Do et al. 2002). There, we introduced and 
discussed the major criteria influencing the effectiveness of a schema matching 
approach, e.g., the chosen test problems, the design of the experiments, the 
metrics used to quantify the match quality and the amount of saved manual 
effort. Apart from the Cupid evaluation, which represents the first ever effort 
to evaluate multiple systems on uniform test problems, the problems used in 
other approaches originate from very different domains of varying complexity. 
While some evaluations used simple match tasks with small schemas and few 
correspondences to be identified, several systems also showed high match qua- 
lity for somewhat more complex real-world schemas (COMA, LSD, GLUE, 
Semlnt). Some evaluations, such as Autoplex, Automatch, completely lack 
the description of their test schemas. Unlike other systems, Autoplex, Au- 
tomatch and LSD perform matching against a previously constructed global 
schema. All systems return correspondences at the element level with simi- 
larity values in the range of [0,1]. Except for Semlnt, correspondences are 
of l:l-local cardinality (using the taxonomy of (Rahm and Bernstein 2001)), 
providing a common basis for determining match quality. 

As we have discussed in Chap. 9, matching is a subjective operation and 
there is not always a unique result. The evaluation that we presented in 
this dissertation is the only matching evaluation we are aware of that took 
into account the subjectivity of the user perception about required match 
correspondences. The schemas utilized in the user study are in Appendix A. 
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Previously proposed metrics for measuring the matching accuracy (Li 
and Clifton 2000; Doan et al. 2001) did not consider the extra work caused 
by wrong match proposals. Our quality metric, matching accuracy, which 
we used for evaluating the SF algorithm is related to the precision/recall 
metrics developed in the context of information retrieval. It has been used in 
subsequent work by (Do and Rahm 2002) under the name “Overall” metric. 
Our metric is similar in spirit to measuring the length of edit scripts as 
suggested in (Chawathe and Garcfa-Molina 1997). However, we are counting 
the edit operations on mappings, rather than those performed on models 
to be matched. Another source of extra work is additional preparing and 
training effort. SF and Semlnt do not require any such pre-match effort, 
unlike other approaches such as the use of neural networks (Li and Clifton 
2000) or machine learning techniques (Doan et al. 2001). 

Open Issues and Promising Techniques. Schema matching is an Al-complete 
problem, i.e. , it is as hard as simulating human intelligence. Thus, there is 
little hope that we will be able to automate it fully anytime soon. Although 
semiautomatic techniques can be quite useful in many scenarios, in many 
other applications automated schema matching is less effective and matching 
can rather be compared to a complex design task. Development of powerful 
GUI tools is essential to support such design tasks. 

One of the hardest open problems in schema matching is the computa- 
tion of mapping expressions that describe value transformations and struc- 
tural manipulations of data, such as aggregation or transposition. Mapping 
expressions are needed to make the mappings operational, i.e., to map in- 
stances of schemas from one representation into another. They can be used 
to generate SQL views, XSL transformations, or Java programs that can be 
directly executed. Specifying mapping expressions is typically a much more 
expensive step as compared to finding schema correspondences. In addition, 
it is very hard to automate. With the exception of Clio (Miller et al. 2000; 
Miller et al. 2001), an overwhelming majority of schema matching approa- 
ches focus on determining schema correspondences. In Clio, after the element 
corresondences have been uncovered, the engineer is presented an exhaustive 
list of possibilities of computing the joins over the source schema and can 
select the desired ones by examining the data samples that are judiciously 
chosen by the system. The formulas describing value transformations have to 
be entered manually by the engineer. The recent work by Brown and Haas 
(2003) may be helpful for identifying such formulas. 

One of the most promising techniques in schema matching is reuse of exi- 
sting mappings by composition. The effectiveness of this approach for finding 
schema correspondences was first studied in (Do and Rahm 2002). A ma- 
jor benefit of reuse is that the technique could also be deployed to compute 
mapping expressions. To our best knowledge, reuse of mapping expressions 
has not been addressed in the literature yet. We illustrate and define the 
problem of reuse by composition at the end of this section. Mapping adap- 
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tation, which can be seen as a variant of reuse, is considered in (Velegrakis 
et al. 2003). We disscuss that approach in Sect. 10.8.2. The problem of how 
the schema element labels, which are reused across different websites, can be 
utilized to drive the schema matching task between n schemas is examined 
in (He and Chang 2003) . Their work considers the schema matching problem 
in the context of integrating the so-called Hidden Web databases. 

Exploiting instance data offers another range of promising schema mat- 
ching techniques. In addition to Bayesian learners (LSD/GLUE) or neural 
networks (Semlnt), schema induction can be utilized to derive a more detai- 
led description of the underlying data. This approach is especially valuable if 
the schemas are generic, such as entity-attribute- value tables that store hete- 
rogeneous information using just a few tables (Agrawal et al. 2001b), or when 
schemas are missing entirely, e.g., in semistructured databases. Data mining 
techniques such as those presented in (Nestorov et al. 1998) can help derive 
schemas from instance data. A recent work exploiting statistical correlation 
of instance data is presented in (Kang and Naughton 2003). 

Reuse in Schema Matching. We illustrate and formalize the problem of map- 
ping reuse. The key ideas that we exploit were presented in (Rahm and Bern- 
stein 2001; Do and Rahm 2002; Halevy et al. 2003; Kementsietsidis et al. 
2003). Informally, the problem of mapping reuse can be stated as follows: 
given a repository of schemas and mappings, derive the mapping between 
two given schemas using the mappings contained in the repository. Below, 
we specify the problem using two operators, composition and confluence. 

Do and Rahm (2002) observed that given two mappings mi_m 2 and 
m 3 , we can compute the mapping between mi and m 3 by composing 
the two mappings. They suggested the use of composition as a heuristic and 
argued that composition may yield incorrect results when the transitivity 
assumption of element correspondences does not hold. Their conclusion, ho- 
wever, describes a property of the mapping language they used, and not a 
property of the composition operation. 

The result of composition mi_m 2 o to ? m 3 defines a correct mapping bet- 
ween toi and m3, although this mapping is typically incomplete. In general, 
any k - way composition “path” that connects m p and m q describes a partial 
mapping between m p and m q . A more precise mapping between m p and m q 
can be obtained by aggregating several partial mappings using the Confluence 
operator. To illustrate, consider the following example. 

Example 10.2.1. Assume that the following schemas and mappings are con- 
tained in the repository: 

mi = «Ri, Si, Ti» 
m 2 = «R 2 , S 2 » 
m 3 = «T 3 » 
m4 = «R4, S4, T4» 

mi_m 2 = «Ri = R 2 , Si C S 2 » 
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TO2_m4 = «R,2 = R4, S4 C S2» 
mi_m 3 = «Ti C T3» 

777.3 = «T3 C T4» 

For example, mapping roi_TO2 states that relations Ri and i?2 have the same 
extensions, while the tuples of S\ are a subset of tuples of S 2. The goal is to 
infer the “best” mapping between mi and 7774 based on the information in 
the repository. Observe that one path between m 1 and 7714 can be obtained 
via m2 as 

mi_m.2 o 77i2_m4 = «Ri = R4» 

The composition over m3 yields another partial mapping 
mi_m.3 o m,3_m4 = «Ti C T4» 

The confluence of both composition paths yields a “maximal” mapping bet- 
ween mi and 7714 derivable from the repository 

map =(mi_m2 o © (mi_m3 o m3_m4) 

= «R X = R4, Ti C T4» ■ 

Formally, we state the reuse problem as follows: given a repository of 
schemas and mappings, infer a maximal mapping between two given schemas 
m p and m q , defined as 

mp_m q = ®{mapi 1 o ■ ■ ■ o mapi k ) 
i 

where Domain (map^) C m p and Rang e(mapi k ) C m q . 

Notice that mp_m q is in general still a partial mapping. For instance, in 
Example 10 . 2.1 it may well be the case that the exact mapping between mi 
and 7714 is «Ri = R4, Ti = T4». The equality Ti = T4 may not be derivable 
from the repository, but only a weaker condition T\ C T4. The maximal 
inferable mapping may still require post-processing by a human engineer, 
since it is guaranteed to be correct albeit not complete. 

The reuse scenario presents a number of computational challenges. Exam- 
ples are avoiding the computation of redundant composition paths or finding 
the best order of executing the composition, similar to finding an optimal 
query plan for multiple joins. Also, some mappings stored in the repository 
may have been obtained using model management scripts, in which case it 
may be possible to optimize the computation of the maximal inferable map- 
ping even further. 



10.3 Mapping Composition and Compose 

Composition is a fundamental algebraic operation. It plays a key role in many 
different areas of mathematics and theoretical computer science and is consi- 
dered a basic axiomatic abstraction in category theory (compare Sect. 10 . 6 . 3 ). 
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In database research, mapping composition has been primarily considered 
for queries, i.e. , functional mappings. For example, answering a query sta- 
ted against a virtual, non-materialized view amounts to composing the query 
with the view definition. In relational algebra, each relational operator such 
as projection or selection describes a database transformation. Thus, a for- 
mula expressed in relational algebra, such as 7 Ta(c7b= 5(R 1x1 S)), can be viewed 
as a composition of several elementary transformations. Composition of data 
transformations was considered as early as in (Shu et al. 1977; Paolini and 
Pelagatti 1977; Borkin 1978; Bancilhon and Spyratos 1981). Commutativity 
of elementary transformations provides a foundation for query optimization. 
Abiteboul et al. (1995) note that relational algebra queries are closed under 
composition, i.e., the result of composing two relational algebra queries can 
always be expressed as another relational algebra query. 

Composition is likely to be the most frequently used operation in model- 
management scripts. Composition implemented in Rondo respects the state- 
based semantics of the operator suggested in Sect. 4.2.1. Although our imple- 
mentation focuses on morphisms, composition is a truly generic operation, 
which has been utilized in many scenarios and for various kinds of mapping 
languages. For example, in GUI tools composition of the operations recor- 
ded in the undo history can be viewed as a compensating transformation for 
user editing operations. In database recovery, the effects of multiple updates 
are composed to speed up and ensure the correctness of recovery. In (Spac- 
capietra and Parent 1994; Rosenthal and Reiner 1994; Atzeni and Torlone 
1996) composition is performed on schema transformations rather than on 
instance transformations and is exploited to study soundness and completen- 
ess of rewriting rules. Bernstein and Rahm (2000) illustrate how composition 
can be exploited in data warehousing scenarios. 

A significant application of composition, which has gained importance 
with the advent of XML, XQuery, and XSLT, consists in composing XQuery 
or XSLT queries over XML views on relational data with the views and pus- 
hing it down for query optimization in the relational engine. As reported in 
(Fernandez et al. 2002), three commercial XML publishing systems, Oracle 
XML SQL Utility, IBM DB2 XML Extender, and Microsoft SQL Server, 
support such query composition. Although querying XML publishing views 
can be considered a query optimization problem, the work on this applica- 
tion produced algorithms that could be utilized for implementing composi- 
tion (and decomposition) of expressions in different mapping languages in a 
model-management system. 

We explain the general principles behind the aforementioned application 
using the work (Shanmugasundaram et al. 2001a) as an example. Their ap- 
proach is illustrated in Fig. 10.1 (presentation in the figure differs from that 
used in the paper). There, a publishing view (vdef o^user) over relational data 
is defined using a composition of a so-called default XML view i> de f and a 
user-defined view i> user expressed in XQuery. The queries q are run against 
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such publishing views. To push down the queries, the following approach is 
used. First, a query is composed with the publishing view. That is, a 3- way 
composition of the query with a user view and a default view is computed 
as Udef ° fuser ° Q- The result of this 3- way composition is represented inter- 
nally as a so-called XML Query Graph Model, qxQGM- Then, this internal 
mapping is decomposed into two transformations: a SQL query qsQL and a 
so-called tagger graph Utag (expression in yet another language that gene- 
rates an XML document from the query results delivered by the relational 
engine). The purpose of this decomposition is to push down data and me- 
mory intensive computation to the underlying relational engine. Thus, a user 
query is processed by creating an intermediate mapping using a 3-way com- 
position and then translating it into an equivalent mapping expressed as a 
2-way composition. The examined subset of the XQuery language supports 
nested expressions and nested order, while the presented view composition 
technique is shown to be complete and produce minimal SQL queries. Fern- 
andez et al. (2002) present a very similar approach (there, default views are 
called canonical views). 



user query 



result (XML) 



tagger 




user view 




(XQuery) 


^user 



default view (XML) 



^def * ^user * Q 
?XQGM = 



result (relational) 




<?sql SQL q uer y 



RDB schema 



Fig. 10.1. Use of composi- 
tion in (Shanmugasundaram 
et al. 2001a) 



More recently, Li et al. (2003) investigated composition of XSL trans- 
formations, instead of XQuery transformations, with XML publishing views. 
Composition of transformations on semistructured data was explored in (Pa- 
pakonstantinou and Vassalos 1999). 

Madhavan and Halevy (2003) depart from composing purely functional 
mappings and study composition for a GLAV language, which combines the 
global-as-view (GAV) and local-as-view (LAV) formalisms. In the mapping 
language they consider, a mapping is expressed as a set of formulas of the kind 
Qa Q Qb, where Qa and Qb are conjunctive queries over two schemas. They 
show that in this setting the mapping produced as result of composition may 
not be representable using a finite expression. However, the composition turns 
out to be finite for a useful subset of the considered language. The authors 
present an algorithm for obtaining a minimal composition and establish its 
complexity bounds. 




10.4 View Selection and Extract 
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The works (Bernstein and Rahm 2000; Bernstein 2003) give a structural 
definition of composition, which assumes that mappings are represented as di- 
rected acyclic graphs. They distinguish several variants of composition, such 
as left/right and outer composition. It is difficult to assess how their definiti- 
ons relate to the state-based composition (Definition 4.2.1), since the focus of 
their work is on structural semantics and no instance-level characterization 
of the mapping language that they deploy is presented. 



10.4 View Selection and Extract 

The intuition that we exploit in defining the operator Extract is closely related 
to the problem of view selection in data warehouse design. The view selection 
problem can be stated as follows (Theodoratos et al. 2001; Chirkova et al. 
2001): given a set of workload queries over a database schema, select a view 
to materialize in the data warehouse such that: 

1. All queries can be answered using the materialized view, 

2. The warehouse design is optimal with respect to a certain cost metric 
(e.g., combination of query evaluation and maintenance cost), 

3. All operational constraints are satisfied (e.g., the warehouse fits into the 
available storage space). 

If the set of queries to be supported is limited to one query, condition (1) 
can be stated using the operator Extract. More precisely, it corresponds to the 
materialization of Extract (compare Sect. 4.3), since it lacks the minimality 
constraint. Conditions (2)- (3) can be seen as tuning knobs that drive the 
materialization. In fact, condition (1) has been considered in more depth in 
the context of answering queries using views (see Sect. 10.1.2), while the 
warehouse design literature focused primarily on practicability of the design, 
i.e., conditions (2)-(3). 

Various kinds of algorithms and query rewriting techniques have been sug- 
gested for view selection, including randomized algorithms and heuristic ap- 
proaches. In (Theodoratos et al. 2001), queries are combined into a so-called 
multiquery graph. Multiquery graphs are rewritten using a set of transfor- 
mation rules, such that each rule preserves condition (1), i.e., is sound. The 
authors also prove completeness of the presented rules. Chen et al. (2002) 
introduce another data structure, a so-called merging tree that is used to 
combine the candidate views derived from the workload. The algorithms for 
obtaining minimal views were presented in (Li et al. 2001). 

Important theoretical results were presented by (Chirkova et al. 2001). 
In particular, they study the cardinality of resulting view configurations and 
establish lower and upper complexity bounds for the problem. Many, if not 
most, approaches to view selection focus on select-project-join queries for rela- 
tional schemas under set semantics. Multiset semantics and group-by queries 



182 



10. Related Work 



are considered in (Agrawal et al. 2001a; Chen et al. 2002). In a more recent 
work, Gupta et al. (2003) addressed the view selection problem for XML 
schemas in content-based routing. 

Although using materialized views for query optimization is a relatively 
old idea, it has only recently been adopted in commercial database systems. 
Agrawal et al. (2001a) developed a tool that recommends materialized views 
and indexes and examines tradeoffs between using indexes and materialized 
views for a given query workload. The tool ships with Microsoft SQL Server 
2000 . 

A major source of complexity of the view selection problem originates 
from the fact that workload queries can interact in various ways. The operator 
Extract that we presented takes only one mapping as input. Although it would 
be easy to extend the definition of the operator for n input mappings, we 
conjecture that the n-ary case can be expressed by a combination of existing 
operators. We consider this aspect in more detail in Sect. 11.3.3. 



10.5 View Complement and Diff 

The operator Diff generalizes the notion of a view complement studied exten- 
sively in the context of data warehousing. The notion of view complement 
was introduced in the groundbreaking work by Bancilhon and Spyratos (1981) 
who used it as a vehicle for a formal treatment of the view update problem 
(Dayal and Bernstein 1978). Two views are complementary if given the state 
of each view, there is a unique corresponding state of the source database. 
That is, the two views are sufficient to reconstruct the database. Recently, it 
has been shown (Laurent et al. 2001) that view complements can be exploi- 
ted to guarantee desirable data warehouse properties such as independence 
and self-maintainability (a data warehouse view is called self-maintainable if 
the updated warehouse can be computed directly given the reported changes 
in the sources without additional maintenance queries). The theoretical pro- 
perties of view complements were studied in (Keller and Ullman 1984; Hegner 
1994). The computability of view complements was examined in (Cosmada- 
kis and Papadimitriou 1984; Laurent et al. 2001; Lechtenborger and Vossen 
2003) . In the remainder of this section we expand the summary given in this 
paragraph. 

The view update problem consists in finding a correct translation of an 
update on the view into an update on the source database. One of the 
main correctness criteria suggested in (Dayal and Bernstein 1978) requires 
that a view update translation have no side effects on the view. Since a 
view typically does not preserve all information in the source database, such 
translation is in general non-unique. Bancilhon and Spyratos (1981) suggested 
that the desired update policy could be characterized by choosing a certain 
view complement and making sure that it remains invariant under updates. 
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Bancilhon and Spyratos (1981) used a state-based formal framework. In 
the definition of the view complement g of / they require the set of pairs 
of instances of the respective view schemas, {{f(x),g( x) \ x £ m}, to be 
equipotent with the source schema m. Notice that this set is equipotent 
with the mapping / o Invert(g). Essentially, their requirement corresponds 
to condition (ii) of Definition 4.2.5. They also reformulate their definition of 
view complement similarly to what we did in Theorem 4.2.5 by requiring the 
instances of the source schema that were indistinguishable to become distin- 
guishable in the view complement (their Theorem 4.2, p. 564). The authors 
note that a view can have many different complements and that the iden- 
tity view ld(m) is a trivial complement for each view. This fact suggests that 
without some sort of minimality conditions the definition of the view comple- 
ment can always be satisfied trivially. Thus, (Cosmadakis and Papadimitriou 
1984; Laurent et al. 2001) focus on “small” or “minimal” complements. 

Our definition of Diff (Definition 4.2.5) corresponds to that of a minimal 
view complement. As we noticed in Sect. 4.3, the minimality condition can be 
relaxed by looking for a materialization of the mapping, or view complement, 
of interest. That is, we can obtain any view complement v c from the minimal 
view complement v m i n by composing v c with some function h such that 
v c oh = v m i n . Also, note that our definition of Diff and the results with respect 
to the alternative formulation are more general than those in (Bancilhon and 
Spyratos 1981) in that they apply to arbitrary mappings and not only total 
surjective functions. 

Our definition and the formalization of Bancilhon and Spyratos (1981) are 
decoupled from concrete schema and view languages. The materialization of 
view complements for database schemas has been first addressed by Keller 
and Ullman (1984). In particular, they considered the classical relational 
schemas whose instances are subsets of finite powersets. They showed that a 
so-called monotonic view has at most one complement that is independent 
and monotonic. They also presented an illuminating way of visualizing view 
complements in a tabular form. Hegner (1994) considered a less restrictive 
setting and showed that under certain conditions the complemented views 
form a Boolean algebra. 

The algorithms for computing view complements were first studied by 
Cosmadakis and Papadimitriou (1984) for the views expressed in relational 
algebra using projection, selection, and join. They showed that for relational 
schemas with arbitrary functional dependencies computing a minimal com- 
plement is NP-complete, but did so using a peculiar notion of minimality that 
differs from the one used in our work and other approaches cited above. More 
recently, Lechtenborger and Vossen (2003) argued that it may not even be 
necessary to look for the minimal complements; instead, reasonably small yet 
non- minimal complements may be more useful in practice. The authors study 
the problem for a large class of relational views that include the relational 
difference operator. 
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Most approaches dealing with view complements focused on relational 
databases and relational views under set semantics. As we demonstrated using 
examples in Sect. 4.2.5, the results of Diff may look quite differently in the 
case of multiset semantics. The work (de Amo and Halfeld Ferrari Alves 2000) 
presents algorithms for computing minimal view complements for temporal 
databases by translating views that are expressed in temporal algebra to first 
order expressions over non-temporal relations. 

Stated as a view complement problem, computing the results of Diff has 
always been considered as a standalone operation. The problem becomes 
much harder if Diff is combined in scripts with other operators. As we de- 
monstrated using the change propagation scenario in Chap. 5, studying the 
properties of such complex scripts can be non-trivial. 



10.6 Approaches to Specifying Semantics 

10.6.1 Semantics of Models and Mappings 

The approach presented in Chap. 4 follows a standard way of specifying se- 
mantics used in databases and formal logic. For example, in model theory, 
which provides standard semantics for mathematical logic, the semantics of 
logical expressions is explained in terms of all possible worlds or interpre- 
tations that are consistent with these expressions. Similarly, semantics of 
database schemas is traditionally expressed in terms of all possible database 
states or instances that are consistent with the schema (Borkin 1978; Bancil- 
hon and Spyratos 1981; Atzeni et al. 1982; Abiteboul et al. 1995). In (Mad- 
havan et al. 2002), the semantics of models and mappings is characterized 
directly in terms of model-theoretic interpretations. In all these approaches, 
the semantics of a formal artifact is described by a set (or family) of other 
formal artifacts. We call this universal concept instantiation and use it as 
a foundation for defining the state-based semantics of model-management 
operators. 

In contrast to many other techniques, the notion of state-based semantics 
exploited in this dissertation is not tied to a specific data model, such as 
the relational model, or language, such as the first-order logic or SQL. By 
raising the level of abstraction, we characterize the semantics of operators 
in a generic fashion. That is, we consider the instances of models as opaque 
entities and characterize the operators that manipulate models and mappings 
without considering the internal structure of the instances. In principle, the 
instances of models can be treated as models themselves, i.e., can have their 
own sets of instances. This flexibility is required in the approaches such as 
(Atzeni and Torlone 1996; Cluet et al. 1998) which exploit multiple levels 
of instantiation. (The fact that instances can be treated as models might 
have contributed to the apparent overloading of the term “model” used in 
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the literature on databases, AI, and formal logic, which was observed in 
(Madhavan et al. 2002).) 

We illustrated the operators using relational algebra and SQL views, and 
examined in more detail a very simple mapping language, morphisms. A 
multitude of other mapping languages have been utilized in the literature for 
addressing various metadata management scenarios (Bergamaschi et al. 1999; 
Bernstein 2003; Claypool 2002; Cluet et al. 1998; Davidson et al. 1995a; Fan 
et al. 2003; Halevy 2001; Kementsietsidis et al. 2003; Li et al. 2003; Madhavan 
and Halevy 2003; Melnik et al. 2003b; Mitra et al. 2000; Popa et al. 2002; 
Pottinger and Bernstein 2003; Velegrakis et al. 2003) - their number can 
probably compete only with the number of data models, or schema languages, 
developed for the same purpose. The state-based approach characterizes the 
semantics of these mapping languages in a uniform fashion as a relationship 
on instance sets. One of the earliest manifestations of this approach can be 
found in (Paolini and Pelagatti 1977), where databases were represented by 
many-sorted algebras and mappings were treated as homomorphisms. The 
authors demonstrated how update operations on databases can be treated as 
mappings. A similar technique was developed in (Maibaum 1977). Treating 
mappings as relationships between instances allows us to specify and study 
mapping containment, mapping composition, and mapping confluence for 
heterogeneous mapping languages. This capability is especially important 
in scenarios that deploy more than one mapping language (Li et al. 2003; 
Shanmugasundaram et al. 2001a). 

Lenzerini (2002) suggested a very general way of specifying mappings as 
Q i ~ Q 2 , where ~ is some predicate that holds between the results of queries 
Q i and Q 2 . We notice however that this mapping specification makes use of 
a third common model in which the results of Q\ and Q 2 are represented. 
Thus, effectively Q\ ~ Q 2 describes a ternary mapping that holds between 
three models (compare Sect. 11.3.4). 

An interesting observation is that many schema languages such as XML 
Schema or almost every semantic database model, e.g., Schema Intention 
Graphs (Miller et al. 1994), support quite expressive constraints and, in fact, 
can be used as mapping languages: if we assume that the entities used in 
schema m are defined in several other schemas mi, . . . , m n , then effectively 
m describes an n-ary mapping between the schemas mi, . . . , m n . 

The mapping languages vary substantially in their expressive power and 
some are better suited for a given purpose than others. For example, the map- 
ping tables of (Kementsietsidis et al. 2003), which we discuss in Sect. 10.9, 
is a mapping language that addresses the needs of quite different applicati- 
ons than does SQL or XQuery. Hence, there may not be a best language for 
each model management scenario. If the mapping map between two models 
is expressed using several partial mappings mapi, each potentially in a dif- 
ferent mapping language, the confluence operator can be used to specify the 
semantics of map as map = ®tmapi . 
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10.6.2 Information Capacity 

State-based semantics is closely related to the work on information capacity 
(Hull 1986; Miller et al. 1994). The information capacity of schema m is the 
cardinal number of its instance set, |lnst(m)| or simply |m| in our simplified 
notation. The relative information capacity of two schemas m\ and m 2 refers 
to the relationship that holds between |mi| and |m 2 |, such as < or =. The 
relationship < is called dominance and is characterized by the existence of a 
surjective function map from m 2 onto m\. 

A key question in the work on information capacity has been whether 
a given database schema is more, less, or equally expressive than another 
database schema, i.e., whether there exists a surjective or bijective function 
between mi and m 2 . In contrast, the model-management approach focuses 
on obtaining the actual mappings between m\ and m 2 that can be deployed 
by applications. Such mappings are specified by means of model-management 
scripts and may be non-functional (for example, morphisms and GLAV map- 
pings are typically non-functional). 

In (Hull 1986), four progressively more restrictive notions of dominance 
are studied (absolute, internal, generic, and query dominance). For exam- 
ple, query dominance means that the function map can be specified using a 
first-order predicate calculus expression. Still, even this restrictive notion of 
dominance turns out too liberal to accurately measure whether an underlying 
semantic connection exists between database schemas. The underlying line 
of argument is to present a transformation between two schemas that esta- 
blishes query dominance but appears semantically vacuous. In (Miller et al. 
1994), it is suggested that more restrictive notions of dominance need to be 
developed. That is, more constraints should be placed on the mappings. And 
yet, the mappings deployed in real applications can be arbitrarily complex. 
For example, some wrappers of legacy systems squeeze multiple attributes of 
structured XML messages into a single general-purpose ASCII field. 

We argue that dominance and equivalence are inadequate measures of 
semantic connection between schemas. In fact, we intensionally use the 
term equipotence rather than equivalence to avoid implying such seman- 
tic connection. The fact that dominance or equipotence holds provides 
no guarantees as to whether and how the schemas are semantically re- 
lated. For example, schema mi = -cR(Name : char, Age : int)» is equi- 
potent with m 2 = «S(ID : int, Name : char)», i.e., |mi| = |m 2 |. Howe- 
ver, mi and m 2 are semantically incommensurate; the mapping mi_m 2 = 
«7'"Name(H) = 7>"Name(S)», which may reflect the relationship between the 
schemas in a particular application context, is not even a function. 

In fact, it seems that most schemas used in practice are related in an 
inherently non-functional way. Frequently, it is simply irrelevant whether do- 
minance or equipotence holds. Part of the reason for that is that schemas 
rarely specify all valid application constraints because they were not known 
at the time of schema design or are not expressible in the schema constraint 
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language. That is, the schemas often allow “irrelevant” instances, i.e., states 
that can never be reached due to application constraints. The presence of 
such instances makes the transformations on schemas non-functional. From 
this perspective, the undecidability results of (Miller et al. 1994) concerning 
equipotence may look less discouraging. Still, they provide an illuminating 
insight in the nature of difficulties that may have to be addressed in the state- 
based approach to model management. Another instructive result, presented 
in (Hull 1986), is that equipotence of relational schemas without constraints 
implies that the schemas must be structurally identical (up to renaming and 
isomorphism). This result may generalize to other schema and mapping lan- 
guages. 

Considering the information capacity of schemas alone is insufficient for 
defining the state-based semantics of model-management operators precisely. 
For example, in Sect. 2.3.5 we require that the schema produced by the 
Merge operator be at least as expressive as each of the input schemas. In 
Sect. 2.3.3 we argue that the schema delivered by Extract must be at most as 
expressive as the input schema. However, these requirements do not specify 
what the operators do to the instances of schemas: the relationship between 
the instances of the output schemas and the instances of the input schemas 
remains unspecified. In Chap. 4, we define this relationship precisely for all 
key operators. Further limitations of the information capacity approach are 
studied in (Davidson et al. 1995a). 

Our viewpoint is that the semantic relationship between two schemas 
is determined by the mappings that hold between them (as we argue in 
Sect. 11.3.4, the mappings may be n-ary and may involve other schemas). 
Such mappings may or may not be functions and may be arbitrarily complex. 
In Chap. 4, for all key operators we state the conditions on the mappings 
between the output schemas and the input schemas. Combining the operators 
into scripts helps us establish precise criteria on mappings produced in various 
model-management scenarios. 

10.6.3 Category Theory 

A principle underlying our work is that the essence of formal artifacts, such 
as models and instances, is to be sought primarily in the nature of their re- 
lationships with other artifacts of the same kind rather than in their internal 
constitution. This idea has achieved its fullest expression in category theory 
(Mac Lane 1998), an axiomatic framework within which the notions of trans- 
formation (as morphism or arrow), composition, and structure (as object) 
are fundamental, i.e., are not defined in terms of anything else. Note that the 
notion of morphism in category theory, which we consider in this section, is 
not to be confused with a concrete mapping language discussed in Sect. 6.1. 

One of the first applications of category theory to data management was 
studied by Maibaum (1977). The author considered database states of a spe- 
cific database schema as a category, in which the updates transforming one 
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state into another are treated as morphisms between the objects in the cate- 
gory. 

The first category-theoretic approach to semantics in model management 
has been investigated in (Alagic and Bernstein 2001). There, signatures of 
database schemas form a category Sig, while mappings between schemas 
correspond to morphisms. The relationship between schemas and instances 
is captured by a functor Db (a functor is a morphism between categories), 
which corresponds to our instantiation function Inst. A minor terminologi- 
cal difference is that Db maps each database signature Sig to a category of 
instances of Sig, whereas Inst maps a model to a set of instances. Just as 
all signatures form the category Sig, all categories of instances form ano- 
ther category, which we henceforth denote as Inst. In contrast to (Alagic 
and Bernstein 2001), the formalization presented in Chap. 4 unfolds in the 
category Inst rather that in Sig. 

The approach in (Alagic and Bernstein 2001) distinguishes between sche- 
mas and schema signatures. Schemas are schema signatures with associated 
integrity constraints. The authors define the category of schemas Sch, which 
differs from Sig by adding integrity constraints, with the understanding that 
all instances of schemas are guaranteed to satisfy the integrity constraints de- 
fined on the schemas. Explicit treatment of constraints makes their discussion 
somewhat verbose, since the constraints surface in all definitions and theo- 
rems. In their case, this was necessary because their main results concerned 
the mapping of constraints across morphisms. In our approach, constraints 
are an integral part of the schema language, i.e. , each set of instances of 
a model (i.e., each category in Inst) is guaranteed to satisfy all integrity 
constraints. 

In the definition of a category, objects are “just things” for which no 
internal structure is observable by categorical means (composition, identities, 
morphisms, and typing). A key challenge in working with a category of objects 
is to develop a set of axioms that characterize the relationships between the 
objects of the category as precisely as possible. In (Alagic and Bernstein 
2001), the authors characterize two operations, called schema integration and 
schema join, using commutative diagrams. However, they rely on the notion 
of a “matching part” defined intuitively which makes it hard to study their 
operations formally and relate them to the operator Merge. 

The playground of state-based semantics is the category Inst. Working 
with Inst, rather than Sch or Sig, enables us to deal with non-atomic ob- 
jects, i.e., models as sets of instances, and to understand the semantics of the 
key model-management operators more deeply. Well-known scenarios help 
us analyze the properties of the operators and may enable us to derive 
their axiomatic characterization in Sch. For example, the operator Merge, 
(. M,f,g ) = Merge(A, B, h), could be characterized in Sch as: / o / -1 = 1^, 
jog -1 = 1 B , M = dom(h), h = / o j" 1 . The minimality condition of 
Definition 4.2.4 cannot be stated in Sch directly. However, it may be possi- 
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ble to find a set of theorems in Inst involving the operator Merge, such as 
Theorem 4.2.11, Corollary 4.2.1, or Conjecture 11.3.1, that provide further re- 
strictions on the operator and can be stated as axioms in Sch. Alternatively, 
the operators Extract, Merge, and Diff could be viewed as new fundamental 
operators, equal peers of the operators Compose and Id. Notice that the mi- 
nimality conditions in our definitions are essential. Without them, we could 
always extract m from to, or get arbitrarily expressive models as a result of 
Merge. 

Another relevant category-theoretic notion is that of a topos (Bell 1988). 
Topos is a Cartesian closed category in which for each object there exists an 
object of its “subobjects”, which can be regarded as instances. In a topos, 
as in set theory, every object and every arrow can be considered as the ex- 
tension {x | P(x)} of some predicate P. This view corresponds closely to our 
treatment of models and mappings as predicate variables. The category Inst 
seems to be closely related to toposes. 



10.7 Metadata Repositories 

Over the years, a number of research prototypes and commercial products 
have been developed to support metadata management, including Rational 
Rose tools 1 and Microsoft Repository 2 (Bernstein et al. 1999). Such tools 
do an excellent job in providing a design environment or persistent storage 
for metadata artifacts. However, the existing tools do not go far enough to 
support the developers of metadata applications, which may be one of the 
reasons that limited their broad adoption and commercial success. 

Do and Rahm (2000) reviewed several commercial metadata repository 
systems that are specifically targeted at metadata management for data wa- 
rehousing. Typically, repository systems use a relational database for storing 
metadata. The metadata can be accessed and manipulated using SQL, assu- 
ming that the database schema of the repository is known. Some tools provide 
query templates to speed up the construction of frequently used queries, such 
as data lineage and impact analysis queries. 

A SQL or SQL-like query interface to metadata artifacts offers substantial 
help to the developers of metadata applications. However, such approach still 
has significant limitations: 

— A thorough understanding of the relational representation chosen for par- 
ticular metadata artifacts is required in order to write the queries. In other 
words, the developers need to be proficient in the meta-meta model and 
the meta-models of the artifacts that they deploy. 



1 www.rational.com 

2 Currently shipped with Microsoft SQL Server 2000 under the name Meta Data 
Services 
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— The queries are often quite complex, since they operate on the individual 
model elements. 

— It is hard to migrate the applications between different repository systems 
because the applications are tied to particular meta-models and meta-meta 
models. 

In contrast, the model-management operators offer a much higher-level, 
generic interface for application developers. Still, as we pointed out in 
Sect. 3.3, a SQL-like querying capability has proved instrumental for imple- 
menting the model-management operators in Rondo. Such capability may be 
exposed to the developers of metadata applications to complement the model- 
management operators in dealing with special-purpose metadata transfor- 
mations such as schema normalization. Furthermore, low-level, performance- 
sensitive metadata management functionality, such as versioning at the level 
of individual model elements (Bernstein et al. 1999), will likely to continue 
relying on SQL-like APIs. 

The model-management approach has a great potential to boost the capa- 
bilities of today’s repository systems and simplify their use. However, even 
the vendors of repository systems themselves might benefit from the avai- 
lability of model-management operators. In fact, many repository systems 
include pre-packaged metadata applications, such as configuration manage- 
ment or impact analysis applications, whose development and maintenance 
using low-level APIs is costly. 

The books (Marco 2000; Tannenbaum 2001) offer a survey of commercial 
repository systems and discuss implementation options for several metadata 
management tasks. 



10.8 Metadata-Intensive Applications 

As we illustrated in Sect. 1.1, metadata problems arise in a variety of applica- 
tions and scenarios. In this section we discuss in more detail the work related 
to two such scenarios, integration of heterogeneous information processing 
services and change propagation. 

10.8.1 Declarative Mediation 

In the late 90’s, a multitude of information processing services started to 
become available online and supersede the static Web content. Such services 
accept data, process it, and return results. A variety of services went on air, 
such as search engines, digital libraries, flight/hotel/car reservation systems, 
e-shops, tax filing services, or calendar managers. As more such components 
were deployed, the diversity of program-level interfaces had emerged as an 
important stumbling block providing a challenging research opportunity. 
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In the work (Melnik et al. 2000; Melnik 2000; Melnik and Decker 2000; 
Decker et al. 2000a; Decker et al. 2000b) we focused on interoperation of hete- 
rogeneous information processing services. Management of metadata turned 
out to be a heavy component of this work and motivated in part the subject 
of this dissertation. In fact, the systems and frameworks that we developed 
provided us a hands-on case study for metadata management. The code deve- 
loped in these projects served as a seed and inspiration for the programming 
platform that has been prototyped as part of this dissertation. The remainder 
of this section gives a brief overview of this work and summarizes the key 
conclusions. 

The mediation architecture has been used in numerous information inte- 
gration projects (Wiederhold 1992). It introduces two key elements, wrappers 
and mediators. The wrappers hide a significant portion of the heterogeneity 
of services, whereas the mediators perform a dynamic brokering function in 
a relatively homogeneous environment created by the wrappers. A mediator 
typically receives a request (e.g., a query), submits a translated version of the 
request to several services, collects and merges the responses, and presents 
them to the user. Mediators that were developed in previous research efforts 
had some important shortcomings, which are in fact still present in most of 
today’s systems: 

— Mediators are often hard to extend beyond the initial set of services they 
were designed for. 

— It is difficult to incorporate into a mediator components that were develo- 
ped elsewhere. For example, once a particular query translation algorithm 
has been implemented in a mediator, it is very hard to replace it by some 
other query translation package. 

— Most often mediators do not tackle protocol differences. For instance, many 
mediators assume that all their targets communicate via HTTP. 

— Usually it is not easy to extend a mediator to non-search tasks. For exam- 
ple, if a mediator is designed to query multiple search engines, it is hard to 
make it mediate among different payment mechanisms or among different 
document summarization services. 

In (Melnik 2000), we proposed a mediation framework that addressed 
these shortcomings. In (Melnik et al. 2000), we studied an application of this 
framework to the domain of digital libraries. The framework presents a very 
flexible environment where different components, which we call “blades” , can 
be combined to address a specific mediation task. One of the components, in 
particular, is responsible for translating protocols. For example, this compo- 
nent may receive a single synchronous message from a user, and in turn issue 
a sequence of asynchronous messages to perform the requested task. 

In our approach, all data conversion and protocol translation logic is em- 
bodied in mediators, whereas wrappers are as simple as possible and thus can 
be developed with moderate effort. Such wrappers, which we call canonical , 
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capture information contained in the messages of the component by means 
of logical descriptions and are not required to do any processing beyond tri- 
vial syntactic transformation of the messages of the wrapped component. A 
logical description of the request makes it possible to abstract out factors 
irrelevant to the purpose of the service, e.g., whether it is implemented as a 
distributed object or a CGI script. The logical descriptions of the messages 
are encoded as directed labeled graphs and can be manipulated using alge- 
braic operations, transformation rules etc. The interface descriptions of the 
canonical wrappers are represented as finite-state automata that accept and 
send messages of different kinds. 

Even though mediators are shielded from the native components by the 
wrappers, they still have to deal with the semantic heterogeneity of informa- 
tion exposed by the wrappers. The major integration problems include data, 
query, and protocol translation. Thus, in general, a number of different forma- 
lisms are required to describe the mediator logic. For example, manipulations 
of the content of the messages can be expressed using Datalog rules, XSLT, or 
other data manipulation languages such as YATL (Cluet et al. 1998). Trans- 
formations of message sequences can be described using finite-state machines, 
Petri nets etc. For query rewriting, powerful approaches have been developed 
in the database literature (Halevy 2001). In (Melnik et al. 2000) we describe 
in detail which formal techniques we picked for implementing a declarative 
mediation system for digital libraries. 

To facilitate mixing and reuse of different formalisms, the declarative lan- 
guages used by our mediators are represented using a meta-meta-model that 
is capable of capturing and linking expressions in different languages. Our 
meta-meta-model is based on directed labeled graphs, similarly to how the 
messages themselves are encoded (in Rondo, we use an extended version 
of this meta-meta-model that supports ordered relationships). To execute 
mediators that deploy different formalisms and languages, we developed a 
comprehensive runtime environment, which makes sure that the appropriate 
interpreters, or blades, are invoked for the declarative languages used in the 
specifications. 

Lessons Learned. One of the key lessons that we learned from our work on 
declarative mediation is that developing and maintaining an environment 
which uses a variety of complex metadata artifacts is really hard. A first 
step that we took to address the metadata management issues was to store 
mediator specifications and interface descriptions as first-class objects in a 
database system. In this way, we had a reliable persistence mechanism which 
also allowed us to query mediator and wrapper specifications. However, the 
key challenge turned out to be the implementation of operations on these 
complex structures. Examples of such operations are tracing the changes 
of evolving interface descriptions and updating the mediators accordingly, 
generating wrapper skeletons from interface specifications, or determining 
algorithmically the compatibility of wrappers and mediators. We realized that 
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to handle the complexity of descriptions, support for mediator composition 
is required (Melnik 2000). Supporting composition is hard, since multiple 
formalisms may be used throughout the mediator. For example, a complex 
mediator may be specified as a finite-state machine and may invoke Petri nets 
for executing subtasks that require concurrency control. Service composition 
is an open problem that recently has attracted significant attention (Hull 
et al. 2003). 

We also learned that to be able to deal with a variety of metadata arti- 
facts and formal languages, we needed some way of abstracting out the key 
properties of their representation and manipulation. We grew even more con- 
vinced of the importance of a generic approach for metadata management in 
our subsequent work (Melnik and Decker 2000). In (Melnik and Decker 2000), 
we presented a layered approach to information modeling and interoperabi- 
lity on the Web. The key idea of the approach is to automate the translation 
of messages exchanged between heterogeneous components by attaching me- 
tadata to the messages that describes the features of the meta-model utilized 
for representing the message content. The message metadata is split into 
“layers”, each of which describes a certain set of modeling features, such as 
object identity, ordered relationships, aggregation, etc. Our work on decla- 
rative mediation and layering indicated that there was a great potential in 
approaching metadata management in a generic fashion. 



10.8.2 Change Propagation 

Change propagation is a pervasive problem that has attracted substantial 
attention in the database research literature. For example, Roddick (1992) 
lists an impressive annotated bibliography of work done in the 1970-80’s. 
(Roddick et al. 2000) classify manifestations of change by subject, causes, 
effects, temporal and spatial issues, etc. The economical factors of software 
evolution and maintenance have been considered in (Wiederhold 2003). 

Various aspects of the change propagation problem have been studied in 
the database literature including view adaptation, view synchronization, view 
maintenance, and, most recently, mapping adaptation. The general theme is 
to examine the effect of changes of some source metadata artifacts or data on 
target metadata artifacts or data. The subjects of change, i.e., sources and 
targets, can differ substantially. We distinguish some of them below: 

— Changes to a schema affect an existing schema instance. Banerjee et al. 
(1987) study close to two dozen kinds of changes that can occur in object- 
oriented databases, such as adding/dropping instance variables to a class, 
changing default values of variables, changing the order of superclasses of 
a class, etc. They suggest a set of rules of how instance data should be 
adapted to schema changes, and address the soundness and completeness 
of elementary changes. Essentially, soundness and completeness guarantee 
that all operations produce valid object lattices and any object lattice 
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can be obtained using a sequence of operations. The work in (Peters and 
Ozsu 1997) provides an axiomatization of schema evolution in a similar 
perspective. In (Lerner 2000), complex changes spanning multiple classes 
are considered. 

— Changes to a source schema affect a view defined on the schema. This 
change propagation problem is referred to as view synchronization and has 
been studied, e.g., in (Lee et al. 2002). 

— Changes to the instance of a source schema affect the target instance. This 
issue was studied in the context of maintenance of materialized views (see 
e.g. (Mumick et al. 1997)). Claypool et al. (1998) considers schema evolu- 
tion using primitives expressed in OQL, while Claypool and Rundensteiner 
(2003) define a cross-algebra that helps propagating changes on instances 
between two different data models, such as XML and relational. 

— Changes to the mapping (and the target schema) affect the instance of the 
target schema. This kind of change has also been termed as view adaptation 
in (Gupta et al. 1995)): once the view definition changes, the materialized 
view needs to be updated, ideally, without recomputing the view from 
the base relations. This problem is closely related to answering queries 
using views (Halevy 2001), where the query corresponds to the new view 
definition, and the view is the original materialized view. In general, the 
new materialized view can be computed using a portions of data from the 
old view and another portion recomputed from old relations. 

— Changes to instance data yield changes to the schema. This way of propa- 
gating changes takes a reverse path compared to the first item and is im- 
portant for cases when instances are decoupled from schemas (Parsons and 
Wand 2000), description logics (Borgida 1995), or for incremental main- 
tenance of schemas extracted from semi-structured data (Nestorov et al. 
1998). 

— Changes to the source or target schema affect the mapping between the 
source and the target (Velegrakis et al. 2003) and the target schema (Bern- 
stein 2003). 

When schemas are the source of changes, the way that changes are speci- 
fied is another dimension in which approaches to change propagation differ. A 
typical assumption is that the changes are characterized using a finite set of 
primitive operations on schemas, often accompanied by a corresponding set 
of primitive instance transformations. This approach was used, e.g., in (Ba- 
nerjee et al. 1987; Peters and Ozsu 1997; Lerner 2000; Velegrakis et al. 2003). 
The advantage of using a fixed set of elementary schema changes is that we 
can specify precisely how to handle each individual change. The disadvantage 
is that the way in which the schema can evolve is restricted. 

Alternatively, the changes can be described using a mapping between 
the old and new source schema, the approach presented in (Bernstein 2003), 
which we also followed in our work on change propagation (see Sect. 2.1 
and Chap. 5). A mapping is capable of accommodating virtually all kinds of 
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conceivable changes such as schema normalization, changing attribute type, 
applying a user defined function, transposing the schema, etc. In fact, in 
the general case such a mapping need not even be functional. To our kno- 
wledge, propagation of changes described by an arbitrary mapping has not 
been examined previously. 

The mapping that describes the changes can be obtained in various ways. 
One way is to allow schemas to evolve and then find the changes that took 
place by comparing the modified schema to the original version using schema 
matching. Another way is to compose a number of elementary changes and 
thereby leverage the specification of changes developed and studied in pre- 
vious approaches. A history of changes can be produced, e.g., by schema 
manipulation tools. 

In Sect. 5.4, we discussed schema evolution as a special case of change 
propagation. A formal specification of schema evolution offers precise gui- 
delines for computing the effects of changes, in contrast to heuristic rules 
that are deployed in some of the approaches in the literature. For example, 
Velegrakis et al. (2003) consider the problem of adapting the mapping upon 
a schema change and suggest to “make the minimum changes necessary to 
achieve a mapping that is consistent with the new schema”. However, the lack 
of formalization of what constitutes a minimal change makes the adaptation 
techniques that they propose debatable: specifically, we argue that neither 
removal of schema constraints nor addition of new schema elements should 
impact the existing mappings. By modifying the mappings in such cases, new 
ways of relating data instances are “invented” whereas the old mapping still 
holds - a minimal change should arguably leave the mapping intact. 

An important aspect of change propagation is efficiency (see e.g. (Banerjee 
et al. 1987; Gupta et al. 1995; Mumick et al. 1997)). In approaches that focus 
on instance data, a primary concern has been batching and delaying the 
updates, and minimizing their impact on the DBMS performance. Batching 
updates can be expressed as a composition of several individual updates. For 
applications that access old versions of data that has been reorganized in the 
course of schema evolution, reverse transformations need to be computed. 



10.9 Other Related Work 

Data Translation. Data translation was one of first hot topics in the data- 
base research field (Bernstein 1999). In fact, before ACM SIGMOD gained its 
name in 1975 (“Management Of Data”), it was previously called SIGFIDET, 
for “File DEscription and Translation”. Data translation is the problem of 
transforming data when moving it from one application to another. The EX- 
PRESS project at IBM Research was one of the foremost data translation 
projects of its day (Shu et al. 1977). 

A number of rigorous formal techniques for data translation have been 
developed. For example, Kalinichenko (1990) presents a formal definition of 
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data models and manipulates them as formal objects in the process of de- 
velopment of mappings between data models. The author also explores a 
methodology for synthesizing a “unifying generalized data model” for a gi- 
ven set of data models. Markowitz and Shoshani (1992) discuss a formal 
technique for translating Entity- Relationship structures into a relational re- 
presentation. Rosenthal and Reiner (1994) examine equipotence-preserving 
transformations of schemas and give formal proofs of correctness of schema 
rearrangements. They argue for a combination of heuristics and rigorous 
transformations . 

Generic approaches to data translation across different schema languages 
have been explored, e.g., in (Atzeni and Torlone 1996; Cluet et al. 1998). 
The techniques presented there could be used for implementing a generic 
operator for generating one model from another. Such an operator, called 
ModelGen, was suggested in (Bernstein 2003). In our prototype, we are using 
a less general approach, in which each converter is implemented as a custom, 
non-generic operator. In Sect. 5.5, we discussed the impact of data translation 
on model-management scripts and their state-based semantics. 

More recently, the problem of data translation has undergone a resurgence 
of interest in the context of data warehousing (ETL tools) and integration of 
heterogeneous Web sources, in particular, translating relational data to XML 
(Shanmugasundaram et al. 2001b; Fan et al. 2003). 

Mapping Tables. The recently proposed mapping language of (Kementsiet- 
sidis et al. 2003) generalizes the notion of value transformations. It allows 
specifying the dependencies between the entities of two schemas using a so- 
called mapping table, an extensionally defined table of value correspondences. 
Value transformations between entities in schemas can be quite intricate. A 
trivial example is concatenation of first name and last name to obtain full 
name. A more involved example is the transformation of a planar circle re- 
presented by three points into a circle represented by a center and a radius. 
Sometimes, however, value correspondences cannot be represented using a 
formula and need to be defined extensionally. For example, the correspon- 
dences between gene and protein identifiers in biochemical databases, or the 
mapping from Zip and City to State are specified as lists of value tuples. In 
(Kementsietsidis et al. 2003), the authors use a state-based approach to de- 
fine the semantics of mapping tables. They also consider operations on them, 
such as AND (A), OR (V), and negation (-■). Operation A generalizes to 0 
in our formalization, V corresponds to U, and negation can be expressed as 
->mi_TO 2 = rrii x ni 2 — toi_TO 2 . 

Confluence. Kementsietsidis et al. (2003) also considered an operation that 
corresponds to our Confluence operator (personal communication). They note 
that sometimes a small part of two mapping tables is inconsistent. In this 
case, ANDing the information of two tables yields a contradiction and renders 
the whole result unusable. An equivalent of the Confluence operator can be 
used to AND a mutually consistent part of two mapping tables and OR 
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the non-overlapping parts. The advantage of defining confluence in a generic 
fashion is that it can also be used to combine a mapping that consists in part 
of say an SQL view and a mapping table. Another interesting observation is 
that a mapping table is actually an instance of a relational schema. That is, 
we have an example of a mapping that is itself an instance of some model. 

As noted above, the Confluence operator can be used to deal with incon- 
sistent mappings and thus may be useful integration of mutually inconsistent 
data sources (Lenzerini 2002). By Definition 4.2.6, combination of queries 
using the Confluence operator ensures that only mutually consistent answers 
appear in the result. That is, information loss is possible when the input 
queries or views are inconsistent with each other (Agarwal et al. 1995). 

In database literature, the notion of confluence has been used in the con- 
text of active rules and triggers (see e.g., (Aiken et al. 1992)). There, conflu- 
ence is a property of a set of active rules or triggers that holds if the effect of 
rule execution is invariant with the order of their execution. We use the term 
confluence differently, to denote an operator on mappings. See Sect. 10.1 for 
further discussion. 

Z, B-Method, AMN. Z (pronounced: “zed”), B-Method and AMN (Abstract 
Machine Notation) are languages designed for specification and verification 
of computer systems. These languages have formal semantics that is based on 
the set theory and predicate calculus. The schemas in Z describe the possible 
states of a system, its operators, and relationships between its parts (Davies 
and Woodcock 1996). Basically, a schema defines a number of n-ary relations 
with pre-conditions and post-conditions. The language Z is an ISO standard 
(ISO 2002). 

The so-called schema calculus used in Z shows an interesting parallel to 
model management. The schema calculus introduces several operators, such 
as And, Or, Iff, Compose, Implies, Pipe, and Project, for building bigger sche- 
mas out of smaller ones. The operators represent set operations with special 
signature translations. For example, And and Or are similar to ANDing and 
ORing of logical formulae. 

The schema calculus in Z is bound to a specific schema language and its 
operators are more low-level than in generic model management. However, it 
would be interesting to see whether the operators in Z can be generalized for 
other kinds of models. 
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“The significant problems we face cannot be solved at the same 
level of thinking we were at when we created them.” 

- Albert Einstein (1879-1955) 



11.1 Summary of Contributions 

Many problems facing data management and other areas of computer-aided 
engineering involve the manipulation of models. Yet applications that mani- 
pulate models are complicated and hard to build. The goal of generic model 
management is to reduce the cost of developing such applications by raising 
the level of abstraction of model manipulation operations. 

This dissertation presents an initial study of the concepts and algorithms 
for generic model management. To demonstrate that model management ope- 
rators are implement able and useful, we developed a prototype of a pro- 
gramming platform, called Rondo, in which high-level algebraic operators 
are deployed for manipulating models and mappings. The prototype helped 
us experiment with various representations of models, alternative definitions 
of operators, and different algorithms used for implementing the operators. 
Using Rondo, we developed scripts for several practically relevant scenarios, 
such as change propagation and reintegration. We have shown that one can 
solve practical problems using the model management operators, and that 
these solutions require a relatively small amount of code. 

To implement one of the most challenging model-management operators, 
the operator Match, we devised a general-purpose matching algorithm cal- 
led Similarity Flooding. The algorithm can be applied for matching various 
kinds of models in metadata management scenarios as well as for other data 
structures and applications. We examined the computational properties of 
the algorithm and evaluated its quality using a novel accuracy metric and a 
user study that we conducted. 
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We presented a detailed survey of the related work that helped us factor 
out the common aspects of metadata applications and specify the structural 
and state-based semantics of the operators. Specifically, we considered data 
integration, schema matching, mapping composition, view selection, and view 
complement problems. The state-based semantics describes the effect of the 
operators on instances of models. It provides guidelines for implementing the 
operators for complex schema and mapping languages and is independent of 
a particular meta-meta-model. Both structural and state-based semantics is 
critical for specifying the effects of model-management scripts. 

Our implementation experience, backed by the in-depth investigation of 
the individual operations in the research literature, suggests that the question 
raised in the panel discussion (Bernstein et al. 2000a) is likely to have a 
positive answer, i.e., generic metadata management is in fact feasible. Even 
if we cannot handle subtle and complex cases, if we can solve a large class 
of non-trivial problems then we are offering a useful programming platform. 
Still, resolving this debate to the full extent can be done only by writing 
scripts for a substantial number of real applications, which use practically 
relevant schema and mapping languages, and demonstrating that they work. 

In this first dissertation on generic model management we only scratched 
the surface of this emerging field of research. In Sect. 11.2, we attempt to 
give an assessment of the current state of the field and provide a roadmap 
for developing the next generation of model-management systems. Our work 
uncovered many hard technical challenges and exciting new research oppor- 
tunities, which are reviewed in Sect. 11.3. A salient non-technical challenge 
is acceptance by the developer community. As with each new programming 
paradigm, the willingness of engineers to learn a new way of approaching old 
problems is a critical ingredient for success of generic model management. 



11.2 Concluding Discussion 

In this section, we examine the achieved state of the art in model management 
and the gaps that need to be filled in order to build the next generation of 
more powerful and versatile model-management systems. 

In the core of the model-management approach is a set of generic ope- 
rators on models and mappings that simplify application programming. To 
what extent can the techniques developed in the literature and in this dis- 
sertation be called generic? How far can we push the model-management 
approach while claiming genericity? These questions are critical for laying 
out a roadmap for future work and understanding how far we are from achie- 
ving our goals. 

Generic model management techniques address the following three as- 
pects: 
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— Generic applicability: The operators can be applied to various kinds of 
models and mappings. 

— Generic use: The operators are useful for a broad range of model- 
management tasks. 

— Generic implementation: A single implementation of the operators is ap- 
plicable for various kinds of models and mappings. 

Generic Applicability. This aspect refers to the ability of engineers to write 
scripts without worrying about the nature of metadata artifacts they work 
with. To ensure generic applicability, the operators need to provide guaran- 
tees to the engineers that hold for all relevant kinds of models and mappings, 
including database schemas, workflow definitions, interface specifications, etc. 
Obviously, the less is known about the metadata artifacts under considera- 
tion, the fewer guarantees can be provided to the engineers with respect to 
the properties of the operators. In other words, the semantics of the operators 
can be stated only in very general terms or otherwise sacrifice genericity. 

Three distinct ways of achieving generic applicability of model- 
management operators have been suggested. One way is to consider models 
and mappings as syntactic objects represented in a common meta-meta- 
model, for example, as graphs. This approach has been pursued in almost all 
prior work on generic model management, including the prototype developed 
as part of this thesis (Bernstein et al. 2000b; Bernstein and Rahm 2000; 
Bernstein 2003; Melnik et al. 2003b; Pottinger and Bernstein 2003). In 
essence, the operators are specified by means of graph transformations. As 
long as the graph transformations do not exploit any knowledge of what 
the graphs actually represent, the operators can be considered truly generic. 
Unfortunately, there are very few useful operations that can be defined in 
such an agnostic fashion. Largely, they are limited to Subgraph, Copy, and 
the set operations on graphs. In our experience, specification of most if not 
all other operations needs to be adapted to the individual meta-models for 
the operators to produce meaningful results. For example, most operators in 
Rondo (see Table 2.1 on page 23) are defined assuming a concrete mapping 
language, the morphisms. Analogously, the operators presented in (Bernstein 
2003; Pottinger and Bernstein 2003) exploit the properties of a specific 
mapping language, though a more general one. 

A second way to achieve generic applicability is by using state-based se- 
mantics. In this approach, the properties of the operators are characterized 
in terms of instances of models and mappings that are taken as input and 
produced as output. Under the assumption that models possess well-defined 
sets of instances, all key operators can be characterized in a truly generic 
fashion, as we demonstrated in Chap. 4. Such characterization is applicable 
to very complex kinds of models and mappings that are used in real appli- 
cations, including XML Schemas, XQuery, and SQL. Although state-based 
characterization does not provide a detailed implementation blueprint, it is 
sufficiently specific so that the effect of the operators can be worked out for 
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concrete languages. A weakness of the state-based approach is that it says 
nothing about the syntax of models and mappings. Yet, the syntax of mo- 
dels (e.g., their structure and naming of model elements) is important for 
applications. Moreover, for certain kinds of models, such as make scripts and 
other program-like models, specifying the sets of instances formally can be 
non-trivial. 

A third way of addressing generic applicability is an axiomatic one, e.g., 
using a category-theoretic approach (compare Sect. 10.6.3). The idea of the 
approach is to define the operators using axioms that are expressed in terms 
of the operators to be defined. Commutativity of Compose or associativity 
of Merge are examples of such axioms. This approach seems to be the most 
challenging, both in terms of determining a useful set of axioms and imple- 
menting the operators in such a way that the axioms hold when the operators 
are applied to concrete languages. 

From the current perspective, it seems that our best bet for achieving 
generic applicability of operators is to combine state-based semantics with a 
syntax-oriented specification based on a common meta-meta-model. Such a 
combined specification of operator semantics may provide enough guarantees 
to the engineers to deploy the operators for manipulating various kinds of 
models and mappings without having a detailed knowledge of the operator 
implementation. Working out the details of such a combined specification is 
one of the gaps to be filled. It is possible that its syntax-oriented part turns 
out relatively simple: for example, one condition could be that the element 
names of the output models have to be drawn from the corresponding element 
names of the input models. 

Of course, it is conceivable that we face hard limits to the generic appli- 
cability of operators. Most model-management scenarios examined so far in 
the literature focus on schema-like models, e.g., database schemas, ER/UML 
diagrams, or ontologies. To stand to the claim of generic applicability, the 
model-management operators should be applicable to workflow definitions, 
interface specifications, computational models, and other artifacts. Nevert- 
heless, manipulation of schema-like models makes up a lion share of today’s 
metadata management applications. Even if we limit the scope of a model- 
management system to schema-like models, the ability of manipulating such 
artifacts in a generic fashion could yield a dramatic increase in programmer 
productivity. 

Generic Use. The usefulness of the model-management operators for imple- 
menting real applications is probably the most challenging claim in model- 
management research. There seem to be two complementary ways of justify- 
ing this claim: an empirical and a theoretical one. 

The previous work on model management and this dissertation followed 
the empirical path. That work started with a solid intuitive understanding 
of the operator semantics and substantiated the generic use of the operators 
by examining detailed walkthroughs of various model-management problems 
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(Bernstein and Rahm 2000; Bernstein 2003). The prototype Rondo developed 
as part of this thesis helped prove that such abstract programs are indeed 
executable. To address the requirements of industry-strength applications, 
future model-management systems need to support complex mapping lan- 
guages such as SQL, XQuery, or transformation languages used in software 
engineering applications. The ultimate empirical proof of generic use could be 
provided by turning a model-management system into a successful product. 

A complementary way of justifying generic use is a theoretical one. The 
idea is to show that the proposed set of operators is complete with respect 
to the chosen operator semantics. For example, if the operators are speci- 
fied in terms of graph transformations, completeness would ensure that all 
meaningful graph transformations can be realized using a combination of the 
operators. If state-based semantics is assumed, one could attempt to verify 
whether the operators can be used to define output models and mappings that 
describe any chosen set of instances and relations on instances. We consider 
the completeness question in more detail in Sect. 11.3.3. Currently, there is no 
good understanding of what completeness of model management operators 
could mean. This is a important gap to be filled. 

Generic Implementation. Using a single implementation for various kinds 
of models and mappings has been considered a primary objective in most 
existing literature on generic model management (Bernstein et al. 2000b; 
Bernstein and Rahm 2000; Bernstein 2003; Melnik et al. 2003b; Pottinger and 
Bernstein 2003). Generic implementation helps extend a model-management 
system quickly for new kinds of models and mappings. For example, Extract 
and Merge are implemented in Rondo using a single algorithm for each of the 
operators, and a simple callback function to encapsulate meta-model specific 
behavior. Match has a truly generic implementation that does not exploit any 
properties of the underlying meta-models. 

In Rondo, a largely generic implementation was possible due the simpli- 
city of morphisms, the utilized mapping language. For more complex mapping 
languages generic implementation is unlikely. For example, it is hard to see 
how the Compose algorithm of Madhavan and Halevy (2003) or the Merge 
algorithm of Casanova and Vidal (1983) can be embedded into generic ope- 
rators without actually implementing the algorithms. Moreover, these and 
many other algorithms are specialized to concrete schema and mapping lan- 
guages. It seems unlikely that the algorithm of Madhavan and Halevy (2003) 
can be used with little changes to compose mapping tables (Kementsietsidis 
et al. 2003) or expressions in other mapping languages. 

Although generic implementation is a desirable feature, it does not seem 
critical for the success of generic model management. The greatest benefit 
of the model-management approach is expected from using the operators 
for effective application development. Ideally, the developers of metadata 
applications should not be concerned with operator implementation, as long 
as each implementation satisfies the desired operator semantics. 
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Even if generic implementation cannot be achieved, it is still possible 
and desirable to utilize a generic representation of models and mappings 
to simplify the implementation of model-management operators. Generic re- 
presentation amounts to rendering all features of individual meta-models in 
a common “data structure”, the meta-meta-model. Earlier in this section 
we discussed the use of a common meta-meta-model for specifying the se- 
mantics of the operators in a generic fashion. While we think that using a 
common meta-meta-model alone for specifying semantics is problematic, it 
may certainly facilitate the implementation. Low-level transformations of me- 
tadata artifacts that are necessary to support operator execution can be car- 
ried out using a SQL-like declarative language that operates on the common 
meta-meta-model. This approach worked very well in Rondo. Moreover, many 
commercial metadata repository systems offer a SQL interface to metadata 
artifacts stored in a relational representation (compare Sect. 10.7). Hence, 
future model-management systems may leverage the low-level capabilities of 
existing metadata repositories. Finding a convenient generic representation 
for complex mappings is another gap to be bridged. 

A Research Agenda for Model Management. To summarize the above discus- 
sion, we outline a high-level research agenda for developing the next genera- 
tion of model-management systems. We believe that the following research 
directions are among the most promising and challenging ones: 

1. Developing a formal semantics for the operators that combines the state- 
based and structural approach while preserving generic applicability of 
operators. 

2. Developing practical materialization algorithms, i.e., algorithms that 
compute the results of operators effectively, for model and mapping lan- 
guages used in real applications. Existing algorithms suggested in the 
literature for the individual operations should be exploited to implement 
and optimize the execution of complex scripts. 

3. Finding appropriate architectures and techniques for coupling model- 
management systems with applications, tools, and conventional program- 
ming languages. The capabilities of existing metadata repositories should 
be exploited for implementing the operators and algorithms. 

4. Developing powerful user interfaces for building model-management so- 
lutions and supporting user feedback during script execution. Ultimately, 
we envision a tool for building model-management solutions graphically 
using Venn-like diagrams like the ones that we used throughout the dis- 
sertation. In this way, the engineer can simply “draw” a script using a 
graphical development environment and materialize the desired models 
and mappings using a single click. 

Bringing the model-management capability to novel and promising do- 
mains, such as design and management of business processes or network ma- 
nagement, may have a great impact on the way applications are developed 
and maintained today and in the future. 
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11.3 Open Technical Challenges 

The work presented in this dissertation raised many hard technical issues. In 
this section, we review some of them. We believe that resolving these issues 
is instrumental for advancing the state of the art in model management. 

11.3.1 Decidability and Complexity 

The state-based operator definitions are decoupled from any concrete schema 
or mapping languages. However, to make the scripts executable we have to 
consider the operators in the context of specific languages. For example, Mad- 
havan and Halevy (2003) study decidability and complexity of a single opera- 
tion, composition. They consider a GLAV mapping language which consists 
of expressions of the form Qa Q Qb, where Qa and Qb are conjunctive 
queries. A similar in-depth investigation may be necessary to obtain decida- 
bility and complexity results for each operator and each concrete language of 
interest. 

It might be possible to state certain general conditions under which the 
operators are guaranteed to be computable, but it is unlikely. Even very 
simple conditions expressed using a state-based characterization are known 
to be undecidable for particular schema and mapping languages. Examples 
are the query containment problem for Datalog programs (Ullman 1997), or 
the question whether there exists a bijection between mi and m 2 for the 
SIG schema language (Miller et al. 1994). However, for simpler languages 
such questions may become decidable. Thus, query containment has an NP- 
complete decision procedure for conjunctive queries. 

One way of implementing the state-based semantics is by developing what 
we call a closed language system , i.e. , a set of sufficiently expressive schema 
and mapping languages that is closed under all model-management operators. 
That is, the result of each operator can be represented explicitly within the 
language system. It is relatively easy to find very simple languages that form 
a closed language system. The problem seems much harder for more expres- 
sive languages. For example, it would be interesting to investigate whether 
relational schemas with relational algebra (or WOL language (Davidson et al. 
1995a)) used as a constraint and mapping language yields a closed language 
system. 

If the results of each operator are computable and finite, then we can 
obtain the exact results for each script. However, even if certain intermediate 
results of scripts are not representable in finite form, it may still be possible 
to compute the final results or their materializations by script rewriting. 

11.3.2 Equivalence and Entailment of Scripts 

Studying equivalence and entailment of scripts provides the foundation for 
script rewriting and optimization. For example, it may be desirable to rewrite 
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a script into an equivalent script which uses fewer operators or favors opera- 
tors of one kind over another and can therefore be executed more efficiently. 
Script optimization may be critical for practical deployment given that com- 
puting the results of a single operator may be NP-hard (Kementsietsidis et al. 
2003; Madhavan and Halevy 2003) and some models can be very large, such 
as models of executable code. Moreover, since the results of certain operators 
may not be computable or representable in finite form, script rewriting may 
help us translate an infeasible script into a feasible one. 

Testing entailment and equivalence may however be quite hard. For ex- 
ample, we hypothesize that the following conjecture holds (see illustration in 
Fig. 11.1): 




Fig. 11.1. Schematic representation for 
Conjecture 11.3.1 (Associative Merge) 



Conjecture 11.3.1 (Associative Merge). The Merge operator is associative up 
to isomorphism. Formally, the following entailment holds: 

(mi 2 , TOi 2 _ui 2 ) = Merge(mi, m 2 , TOi_m 2 ); 

(m. 23l m 23 _m 2 ,m 23 _m 3 ) = Merge(TO 2 , m 3 , TO 2 _ra 3 ); 

?7li2_TO 3 = TOi 2 _TO 2 o TO 2 _m 3 ; 

77723 7771 = m 23 _m 2 ° lnvert(777i_7772); 

(777o,7U a _mi2,m Q _m 3 ) = Merge(mi 2 , m 3 , m 12 _m 3 ); 

{m b ,m b _m 2 3 ,m b _mi) = Merge(m 23 , mi, to 23 _?71.i); 
ma^_m b = (7U Q _mi2 o mi 2 _TOi o Invert(m 5 _mi)) ® 

(7U Q _mi2 o 77ii2 m 2 o lnvert(777 23 _7772) o lnvert(TO6_7?7 23 )) ® 

(ma^m 3 o lnvert(m 23 _77i 3 ) o lnvert(777b_7772 3 )); 

— >• 

lnvert(m a _?7ib) o ma^nrib = ld(m{,); 

m a m b o Invert (m^m;,) = ld(m a ); // i.e. , ma^jmb is a bijection ■ 

We made some initial progress on a simple theorem prover that uses 
a technique similar to “freezing” of (Ullman 1997) to test equivalence and 
entailment of scripts. Our prover was not able to find a contradiction to the 
above conjecture, but it is currently unable to provide a complete proof. 

11.3.3 Completeness and Redundancy 

Another vital question is that of completeness and redundancy: do we have 
a “complete” algebra with the operators Invert, Compose, Extract, Merge, 
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Diff, and Confluence? What could be suitable completeness criteria? Are the 
operators that we suggest non-redundant, i.e. , is it true that none of the 
operators can be expressed using a combination of others? Are our operators 
the best? What other operators are conceivable? 

We think that it may not be possible to characterize completeness other 
than by definition, similarly to the completeness of relational algebra. We do 
not know yet whether we succeeded to identify all key operators, whether the 
auxiliary operators such as Domain and Id should be considered part of the 
algebra, or whether more operators are needed. 

It may be desirable to introduce other fundamental operators into the 
algebra. For example, operator Hom(m) could return a mapping that esta- 
blishes a homomorphism relationship on instances of m. Such an operator 
could be used to characterize the data exchange scenarios (Fagin et al. 2003) . 
Another useful operator could be an instance inclusion operator lncl(m). It 
returns a mapping in which, say, right instances are entirely included in the 
associated right instances. This operator could be used for characterizing the 
certain answers for queries (compare Sect. 10.1.2). The operators such as 
Horn and Incl cannot be defined in a language-independent fashion, but there 
seems to be a good understanding of how to specify them precisely for each 
schema language of interest so that these operators may be of generic value. 

To illustrate some other possible operators, consider the operator Merge. 
We defined the semantics of this operator based on a data integration scenario 
in which a unified database needs to be constructed. Another important data 
integration scenario is the one where we construct a virtual view of several 
databases to give them the appearance of a single database. This scenario 
could lead to a different operator definition, which figuratively speaking inte- 
grates only the overlapping part of two databases, whereas Merge integrates 
all information. It seems possible to define this operator as a derived operator 
Intersect (see illustration in Fig. 11.2): 




Fig. 11.2. Illustration of Intersect operator 



Definition 11.3.1 (Intersect). ( m,m_p,m_q) = lntersect(p,q,p_q) if and 
only if the following script holds: 

( Px,P-Px } = Extract(p,p_q)-, 

(q x ,q-qx) = Extracting, lnvert(p_q )); 
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(to, m_p x , m_q x ) = Merge(p x , q x , lnvert(p_p x ) o p_q o q_q x ); 

m_p = m_p x o lnvert(p_p x ); 

m_q = m_q x o lnvert(q_q x )-, ■ 

Whether the above definition describes the intended semantics correctly 
or not is however an open question. 

As another example, consider the operator Extract. In Definition 4.2.3, 
the operator takes a single mapping as input. In a more general setting, 
we may be interested in extracting a view that allows us to answer a set 
of given queries ft, . . . , q n rather than a single query (compare Sect. 10.4). 
That is, it must be possible to reformulate each of the input queries against 
the view schema. In other words, condition (ii) of Definition 4.2.3 has to be 
stated for each of the input queries. The question is though, whether the 
definition of Extract needs to be extended or whether it is possible to express 
the desired semantics using a script. Again, we do not have an answer to this 
question. However, we think that we do not need to extend the operator and 
we postulate the following hypothesis to be verified: 

Conjecture 11.3.2 (Extract for two queries). Let ft and g 2 be two queries 
over to, i.e., ft C to x Sj. Further, let Definition 4.2.3 of Extract be extended 
for two mappings, 

(m x , m_m x ) = Extract(?7i, qi , q 2 ); 

such that 

m_m x o lnvert(m_m x ) o q 1 = q 1 - 
m_m x o lnvert(m_m x ) o q 2 = q 2 ; 

and minimality of m x is guaranteed. Then, the script 

(s, s_si,s_s 2 ) = Merge(si, s 2 , Invert(ft) og 2 ); 

(m x , m_m x ) = Extract(m, (ft o lnvert(s_si)) ® (g 2 o lnvert(s_s 2 ))); 

has the same effect on m x and m_m x as the application of the extended 
operator Extract(m, ft, g 2 ). ■ 

In the conjecture we exploit the intuition from the view selection problem 
in data warehousing that the view m x can be computed using a so-called 
multiquery (Theodoratos et al. 2001), which corresponds to the expression 
(gi o lnvert(s_si)) © (q 2 o lnvert(s_s 2 )). Speaking informally, a multiquery 
combines several queries into one. In fact, if queries are represented as graphs, 
a multiquery can be obtained by “merging” the individual query graphs. In 
this process, the view schema induced by the queries changes. The schema 
induced by the multigraph corresponds to the schema s in the script. 

We verified the conjecture in a preliminary form using our simple auto- 
mated prover. The prover could not find a contradiction for the implication 
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(s,s_si,s_s 2 ) = Merge(si, s 2 , Invert(gi) o g 2 ); 

(m x , m_m x ) — Extract(m, {qi o lnvert(s_si)) ® (<? 2 ° lnvert(s_s 2 ))); 

m_m x o lnvert(77i_m x ) o qi = qp, 
m_m x o \rwiert{m_m x ) o q 2 = ( 72 ; 

for arbitrary mappings q\ and g 2 , not only functions. Moreover, replacing 
lnvert(g 1 ) o q 2 by si x s 2 in the premise did not cause the prover to find a 
contradiction either, so that the conjecture may hold even if we simply take 
a union of signatures of si and s 2 : (s, s_s\, s_s 2 ) = Merge(si, s 2 , Si x s 2 ) 
(compare Theorem 4.2.4). 

No matter whether or not the conjectures that we presented hold, there 
may be other fundamental model management scenarios that cannot be ex- 
pressed using a combination of the operators that we defined in this disser- 
tation. 

Notice that we stated Conjecture 11.3.2 for two queries and not for n 
queries. In fact, it turns out that generalizing the conjecture for n > 2 using 
binary mappings is not that easy. That leads us to a more general question, 
whether binary mappings are sufficient to address all model management 
scenarios of interest. 

11.3.4 N-ary Mappings 

An elegant and intriguing extension of the formalization that we presented is 
obtained by considering n-ary mappings, such as map C mi x m 2 x . . . x m n . 
(For n = 1, we call a mapping a model.) To motivate n-ary mappings, consider 
the following example. Imagine that we are given three models mi, m 2 , m 3 
each with a single class definition, class A in mi, B in m 2 , C in m 3 . Now 
we want to establish the fact that C = A U B (e.g., to subsequently merge 
all three models). This is impossible to do if the mappings are limited to 
two models at a time: we can state that A is a subclass of C, and B is 
a subclass of C, but not the condition we want. The desired relationship 
can be specified using a ternary mapping map C mi x m 2 x m 3 , map = 
«C = A U B;s>. Analogously, we can argue for the need of mappings of higher 
arity by examining the condition A n = Ai U A 2 U . . . U A n _i such that each 
of A; is defined in a different model m^. 

The fact that the relationship between two models can only be speci- 
fied using a third model, so-called “helper” model, has been recognized in 
(Madhavan et al. 2002) . The definition of mappings that the authors suggest 
can be viewed as a ternary relationship on model instances. A similar argu- 
ment for ternary mappings was presented in (Pottinger and Bernstein 2003) 
in the context of the Merge operator. The database transformation langu- 
age WOL can be used to express constraints that span multiple databases 
(Davidson et al. 1995a), just as the languages utilized for answering queries 
using views (Halevy 2001). In (Kementsietsidis et al. 2003), n-ary mappings 
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between peer-to-peer sources were considered. Therefore, we think that n-ary 
mappings provide a practically important generalization of the theory that 
we presented. 

All operators that we discussed can be generalized for n-ary mappings. 
For example, the operator Compose becomes very similar to the relational 
equijoin operator, except that the join is performed on entire database states 
rather than attribute values. For example, for map\ C rn\ x m2 x m3, map 2 C 
mi xm 3 xm 4 , we write map\o rni m3 map 2 C m2 X 7714 . Operators Domain and 
Range can be generalized as the operator 77 , which is similar to the relational 
projection operator: 77 mijm3 (rnapi) C m\ x m3. 

Each n-ary mapping map can be viewed as a binary mapping between a k- 
ary and a (n—k)- ary mapping, i.e., map C (mi x . . . xm*,) x (m^+i x . . . x m n ), 
1 < k < n. In this way, we can adapt the definitions of the operators Extract 
and Diff of Sect. 4.2 with very little change. Both operators yield mappings 
of a smaller dimensionality for a given mapping, e.g., allow us to get mi_m.2 
from mi m 2 m 3 . The grouping of an n-ary mapping into such quasi-binary 
mappings can be done in various ways. The operator Invert becomes obsolete, 
since the mapping positions that participate in composition, extraction, etc. 
need to be specified explicitly in each operator anyway. 

The operator Match in general returns an n-ary mapping to reflect the 
fact that we may need one or more helper models to relate k given input 
models, k < n. For example, for k = 2 , n = 3 we can write mi_m.2_77 = 
Match(mi, m 2 ). Model 77 and mappings mi_77, m2_77, and m\_m2 are im- 
plicitly contained in mi_m2_77 and can be obtained using the operator 77. 
To merge n models, we write Merge(mi, m2, . . . , m n , map), where map is an 
n-ary mapping. Extending the Merge operator for n-ary mappings and stu- 
dying its properties may help analyze the associativity of Merge for binary 
mappings (see Conjecture 11.3.1). 

n-ary mappings can be used to characterize dynamic scenarios, such as 
mediation between distributed services. For example, consider that answering 
a query q requires first consulting a data source mi, then formulating a 
query against m2 using the data obtained from mi, and finally combining 
the results for the final answer. This mediation scenario can be characterized 
by a ternary mapping q C mi x m2 x r, where r is the result schema for 
q. Specific execution strategies, such as caching subsets of results from m\ 
and using them for later query processing are abstracted out in the ternary 
mapping. 

11.3.5 Formalization of Model- Management Problems 

A set of high-level operators with well-understood state-based semantics may 
be instrumental for finding agreement on a number of long-standing model- 
management problems and scenarios that have traditionally been addressed 
using heuristic or intuitive approaches. Data integration and schema evolu- 
tion are two prominent examples of such problems. We suggested that the 
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two predominant kinds of data integration, database and view integration 
(Davidson et al. 1995a), can be described formally using the operator Merge 
(Sect. 4.2.4) and the operator Intersect (Sect. 11.3.3). We proposed a formal 
specification of schema evolution in Sect. 5.4. However, our work makes only 
a first step in understanding these scenarios precisely. 

Many more important scenarios are outstanding. Examples include data 
exchange, mediation, or answering queries in a data integration setting. Cha- 
racterizing these scenarios using model-management scripts is a promising 
direction for future research. Such scripts provide implementation guidelines 
for system developers and do so independently of the concrete schema and 
mapping languages deployed by the developers. The scripts can be used as 
formal specifications for driving customized solutions, i.e., they can be valua- 
ble even without a generic model-management system that executes them. 



A. User Study: Gathering Intended Match 
Results 



The user study was handed out to nine members of Stanford Database Group 
in February 2001. The task specifications have been reformatted to fit the page 
size used in the dissertation. The tables for entering the answers are omitted 
for brevity. 

This user study attempts to collect various intended match results for a set 
of schema matching problems. General remarks: 

1. The information provided about the source and target schemas is inten- 
tionally vague. Imagine a plausible scenario and try to map elements in 
both schemas according to the scenario you have in mind. 

2. You don’t have to match every element on the left and every one on the 
right, partial mappings are fine (if consistent with the scenario you have 
in mind). 

3 . m : n correspondences between schema elements are welcome. 

4. No mapping expressions are required. 

5. The elements in the left and right schemas are numbered. Please fill out 
the table following every problem as shown in the example below. 

When you are finished, please return your results to my office (438). 
Thanks a lot! 



Example 

This example shows schematically two XML schemas. 

1 Cust a Customer 



Many possible match results are conceivable for the schemas. Two of them 
are depicted below: 
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2 C# 

3 CName 

4 FirstName 

5 LastName 



b CustID 



c Company 
d Contact 
e Phone 
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Left element (s) 


Right element (s) 


1 


a 


2 


b 


3 


c 


4,5 


d 



Left element (s) 


Right element (s) 


2 


b 


3 


c,d 



A.l BizTalk Schemas (XML) 

Left schema. 

1 <Schema name="Schema 1" 

xmlns= " urn :microsof t-com: xml-data" > 

2 <ElementType name="AccountOwner"> 

3 <element type="Name"/> 

4 <element type="Address"/> 

5 <element type="Birthdate"/> 

6 <element type="TaxExempt"/> 

</ElementType> 

7 <ElementType name="Address"> 

8 <element type="Street"/> 

9 <element type="City"/> 

10 <element type="State"/> 

11 <element type="ZIP"/> 

</ElementType> 

</Schema> 

Right schema. 

a <Schema name="Schema 2" 

xmlns= " urn :microsof t-com: xml-data" > 
b <ElementType name=" Customer "> 
c <element type="Cname"/> 

d <element type="CAddress"/> 

e <element type="CPhone"/> 

</ElementType> 

f <ElementType name="CustomerAddress"> 
g <element type="Street"/> 

h <element type="City"/> 

i <element type="USState"/> 

j <element type="PostalCode"/> 

</ElementType> 

</Schema> 
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A. 2 Property Listing Schemas (XML) 



1 


HOUSE 


a 


listing 


2 


ADDRESS 


b 


location 


3 


COUNTY 


c 


area 


4 


PRICE 


d 


price 


5 


DESCRIPTION 


e 


comments 


6 


CONTACT- INFO 


f 


contact 


7 


OFFICE-INFO 


g 


agent 


8 


OFFICE-NAME 


h 


name 


9 


OFFICE-PHONE 


i 


office 


10 


AGENT- INFO 


j 


brokerage 


11 


AGENT-NAME 


k 


name 


12 


AGENT-PHONE 


1 


phone 






m 


house-style 



A. 3 Library Schemas (XML) 




l 


<E n=" Library "> 


a 


<E n="Collection"> 


2 


<E n="Item"> 


b 


<E n="Document"> 


3 


<e n=" ISBN"/> 


c 


<e n="Identif ier"/> 


4 


<e n="Author"/> 


d 


<e n="Creator "/> 


5 


<e n="Title"/> 


e 


<e n="Contributor"/> 


6 


<e n="Year"/> 


f 


<e n="Publisher"/> 




</E> 


g 


<e n="Title"/> 


7 


<E n=" Author "> 


h 


<e n="Year"/> 


8 


<e n="FirstName"/> 




</E> 


9 


<e n="LastName"/> 


i 


<E n="Creator"> 




</E> 


j 


<e n="Name"/> 


10 


<E n="BorrowedItems"> 




</E> 


11 


<e n="Item"/> 


k 


<E n="Name"> 


12 


<e n="Borrower"/> 


1 


<e n="first"/> 




</E> 


m 


<e n="last"/> 


13 


<E n="Borrower"> 




</E> 


14 


<e n="FirstName"/> 


n 


<E n="Publisher "> 


15 


<e n="LastName"/> 


0 


<e n="Address"/> 


</E> 


P 


<e n="Name"/> 



</E> </E> 

</E> 



A. 4 Product Schemas with Data Instances (XML) 

In this problem, XML tags in both schemas need to be matched given two 
instances of schemas. The numbering enumerates all different tags on the left 
and on the right. Remember, you are matching the tag names, the particular 
instance values provide the hints for the matching process. 
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Left schema. 

1 <amazon> 

2 <item> 

3 <title>Sony DCR-PC100 

Digital HandyCam Camcorder</title> 

5 <listPrice>1899 . 99</listPrice> 

6 <ourPrice>1699 . 00</ourPrice> 

7 <youSave>200 . 00</youSave> 

8 <review> 

9 <avgReview>4 . 5</avgReview> 

10 <mim0fReviews>20</num0f Reviews> 

</review> 

11 <availability>On Order; 

usually ships within 1-2 

weeks</ availability> 

12 <features> 

13 <zoom>10x optical zoom</zoom> 

<zoom>120x digital zoom</zoom> 

14 <lcd>2.5inch LCD</lcd> 

15 <other>4 MB Memory Stick included</other> 
</f eatures> 

</ item> 

</ amazon> 

Right schema. 

a <yahoo> 
b <productInf o> 

c <id>Sony DCR-PC100</id> 

d <merchantPrice>1799 . 94</merchantPrice> 

e <rating> 

f <userRating>3 . 5</userRating> 

g <userReviews>7</userReviews> 

</rating> 

h <description> 

i <LCDScreenSize>2 . 5in</LCDScreenSize> 

j <opticalZoom>10x</opticalZoom> 

k <special>4MB Memory Stick</special> 

</description> 

</productInf o> 

</yahoo> 



A. 5 University Schemas with Data Instances (XML) 

Same problem as the previous one: XML tags in both schemas need to be 
matched given two instances of schemas. The numbering enumerates all dif- 
ferent tags on the left and on the right. Remember, you are matching the 
tag names, the particular instance values provide the hints for the matching 
process. 
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Left schema. 

1 <dbl> 

2 <Faculty> 

3 <SSN>234-56-7890</SSN> 

4 <Facu_Name>Richie Solomon</Facu_Name> 

5 <Salary>170000</Salary> 

</Faculty> 

6 <Student> 

7 <Stud_ID>7206362</Stud_ID> 

8 <Stud_Name>Teresa Lista</Stud_Name> 

9 <Stipend>23000</Stipend> 

10 <Tel>408-973 0110</Tel> 

</Student> 

</dbl> 

Right schema. 

a <db2> 

b <Personnel> 

c <ID>234-56-7890</ID> 

d <Name>Solomon, Richie</Name> 

e <Address>Sand Hill Road, Menlo Park, CA</Address> 

f <W_phone> (408) 495 8423</W_phone> 

g <H_phone> (650) 923 4193</H_phone> 

</Personnel> 

<Personnel> 

<ID>7206362</ID> 

<Name>Lista, Teresa</Name> 

<Address>Cotton St, Palo Alto, CA</Address> 
<W_phone> (408) 973 0110</W_phone> 

<H_phone> (650) 198 2424</H_phone> 

</Personnel> 

</db2> 



A. 6 Catalogs with Data Instances (XML) 

In this problem, catalog entries in both schemas need to be matched given 
two instances of schemas. The numbering enumerates all different catalog 
categories on the left and on the right. 

Left schema. 

<yahoo> 

1 <cat id="Home"> 

2 <cat id="Electronics and Photography"> 

3 <cat id="Television and Video"> 

4 <cat id="Camcorders"> 

5 <cat id="By Format - DV"> 

<product name="S0NY DCR-PC100"/> 

</ cat> 

</cat> 

</ cat> 
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6 <cat id="Photography"> 

7 <cat id="Brands - Polaroid"> 

<product name=" POLAROID PDC 3000"/> 
</cat> 

</ cat> 

</cat> 

8 <cat id= "Movies "> 

9 <cat id="Comedy"> 

10 <cat id="Parody"> 

11 <cat id="Science Fiction"> 

<product name="Mars Attacks! "/> 

</ cat> 

</cat> 

12 <cat id="Satire"> 

<product name="The Graduate"/> 
</cat> 

</ cat> 

</cat> 

</ cat> 

</yahoo> 

Right schema. 

<epinions> 
a <cat id="Home"> 
b <cat id="Electronics"> 
c <cat id="Video"> 

d <cat id="Camcorders"> 

<product name="Sony DCR-PC100"/> 
</cat> 

</ cat> 

e <cat id="Photo"> 

f <cat id="Cameras"> 

<product name="Minolta Maxxum 9"/> 
</cat> 

</ cat> 

</cat> 

<cat id="Arts and Entertainment "> 

<cat id="Movies"> 
i <cat id="Video"> 

<product name="The Graduate"/> 
<product name="Mars Attacks! "/> 
</cat> 

</ cat> 

</cat> 

</ cat> 

</epinions> 
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A. 7 Personnel Schemas (Relational) 



The numbering enumerates tables and columns in both schemas. 



1 CREATE TABLE Personnel 

2 Pno int , 

3 Pname string, 

4 Dept string, 

5 Born date, 



a CREATE TABLE Employee ( 
b EmpNo int PRIMARY KEY, 

c EmpName varchar (50) , 

d DeptNo int REFERENCES 



Department , 



UNIQUE perskey (Pno) 



e Salary dec(15,2), 

f Birthdate date 



); 



); 



g CREATE TABLE Department ( 
h DeptNo int PRIMARY KEY, 
i DeptName varchar(70) 



); 



A. 8 University Schemas (Relational) 

Table on the right presents a previous version of the schema shown on the 
left. The left schema is the evolved schema. Match the new version of the 
schema onto the old one! 

Left schema. 

1 CREATE TABLE Address ( 

2 Id int PRIMARY KEY, 

3 Street string, 

4 City string, 

5 PostalCode int 

); 

6 CREATE TABLE Professor ( 

7 Id int PRIMARY KEY, 

8 Name string, # name 

9 Sal double , # salary 

10 addr int # address 



); 



11 CREATE TABLE Student ( 

12 Name string, 

13 GPA double , 

14 Yr int 



# name 

# grade point avg 

# year of studies 



); 



15 CREATE TABLE PayRate ( 

16 Rank int PRIMARY KEY, # project rank 

17 HrRate double # hourly pay rate 



); 
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18 CREATE TABLE WorksOn ( 

19 Name string, 

20 Proj string, 

21 Hrs int, 

22 Proj Rank int 

); 

Right schema. 

a CREATE TABLE Professor 
b Id int PRIMARY KEY, 

c Name string, 

d Salary double, 

e Address string 

); 

f CREATE TABLE Student ( 
g Name string, 

h GradePointAverage double, 
i Year int 

); 

j CREATE TABLE WorksOn ( 
k StudentName string, 

1 Project string, 

m Expenses double 

); 



# name of student 

# project name 

# hours spent 

# project rank 



( 



A. 9 Personnel/University Schemas (Relational) 

Left schema, is the same as in the previous example. It deals with professors, 
students, and provides the information about who worked on which project 
for how long. Moreover, the schema contains information about payment of 
professors and students. 

Right schema, captures general personnel information: 

a CREATE TABLE Personnel ( 
b Id int PRIMARY KEY, 
c Name string, 

d Sal double , # salary 

e Addr string # address 

); 

Hints, for interpretation of schemas: 

— WorksOn (Proj Rank) may or may not be foreign key for PayRate (Rank) 

— WorksOn (Name) may or may not be foreign key for Professor (Name) or 
Student (Name) 

— Student (Yr) may or may not be foreign key of PayRate (Rank) 

— Pay rate of a student may or may not depend on the year and/or his/hers 
grades 



B. Proofs of Simplification Theorems 



In this appendix, we prove the Theorems 4.2.1, 4.2.3, and 4.2.5, which pro- 
vide simplified characterization of operators Extract, Merge, and Diff, respec- 
tively. For convenience, we repeat the definition of the equivalence relation 
ind (., ., m_m '): 

ind(y 1 ,y 2l m_m') = d f (pi | (yi,zi) £ m_m'} = {z 2 | (y 2 ,z 2 ) £ m_m'}) 

If ind(y i, y 2 , m_m'), we say that y\ and y 2 are indistinguishable under m_m' . 



B.l Extract Operator 

Theorem 4.2.1 (from page 69). Let Domain (m_m / ) C m. (m x ,m_m x ) = 
Extract(m, m_m') holds if and only if the following conditions are satisfied: 

1. m x = Range(m_TOa;). 

2. Domain (m_m x ) = Doma\n(m_m'). 

3. For all (yi,x±), (y 2 , x 2 ) £ m_m x : X\ = x 2 iff ind(yi,y 2 , m_m!). 

Condition (2) makes sure that exactly those instances of m participate 
in m_m x that are connected in m m! . Condition (3) requires collapsing any 
two instances y\ and y 2 of m into a single instance of m x if and only if y\ 
and y 2 are indistinguishable under m_m! . 

Proof: First, we simplify condition (ii), i.e., the equality of mappings m_m x o 
lnvert(m_m a ;)om_m , and m rn' . Notice that for any two mappings mapi and 
map 2l mapi = map 2 holds if and only if: Domain(TOapi) = Domain(map 2 ) 
and for each x £ Domain(mapi): {y | ( x,y ) £ mapi} = {y | ( x,y ) £ map 2 }. 

In the composition m_m x o Invert {m_m x ), the range of m_m x is 
identical with the domain of opInvert{m_m x ). Thus, the composition 
does not drop instances from the domain of m_m x o Invert {m_m x ). 
Therefore, Domain (m_m x o Invert [rn_m x ) o m_m') = Domain (m_m') iff 
Domain (m_m x ) = Domain 

Let y £ Domain (m_m x ). If we traverse y over m_m x to m x and 
back, we obtain the set of round-tripped images Rt(y) = {y' | (y, x) £ 
m_m x and ( y',x ) £ m_m x }, with y £ Rt(y). Traversing from y over m_m' 

S. Melnik: Generic Model Management, LNCS 2967, pp. 221-228, 2004. 

© Springer- Verlag Berlin Heidelberg2004 
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directly must give us the identical set of m'-images as by first round- 
tripping y to Rt(y) and traversing each y' £ Rt{y ) over m_m! . That 
is, for each y £ Domain (m_m x ) and each y' £ Rt(y): {z | ( y,z ) £ 
m_m'} = {z | (y r , z) £ In other words: for all y £ Domain (m_m x ), 

y' £ Rt(y) : ind(y , y ' , m_m') . Now we expand the definition of Rt again 
and get an equivalent expression: for all 2/1, 2/2 £ Domain (m_m x ) with 
(yi,x),(y2,x) £ m_m x : ind(yi,y2,rn_m'). This expression can be further 
simplified as: if (2/1,2;), (2/2,2;) £ m_m x , then ind(yi,y2,m_m'). 

That is, condition (ii) is equivalent to the conjunction: Domain (m_m x ) = 
Domain and for all (2/1, x), (2/2, x) £ m_m x : ind(yi,y2,m_m'). 

Now we turn to the actual proof. First we show that the conditions (l)-( 3 ) 
stated in Theorem 4 . 2.1 are necessary, i.e. , they follow from Definition 4 . 2 . 3 . 
Then, we demonstrate that they are also sufficient. 

(— ►) Let conditions (i)-(iii) hold. Conditions ( 1 ) and ( 2 ) are satisfied trivially. 
To prove condition ( 3 ), let (2/1, x), (2/2, 2;) £ m_m x with 2/1 ^ j/2- Then, 
ind(yi,y2,m_m') follows immediately from (ii). Now, let (2/1, aq), (2/2, aq) £ 
m_m x with yi ^ y 2 and X\ ^ X2 ■ Assume that ind(yi,y2,m_jn') holds. 
Since ind(.,.,.) is transitive, then for all y±, 2/2 with (j/i, aq), (2/2, £2) £ 
m_m x : md(i/i,2/2,m_m'). Observe that condition (ii) remains true when 
we set X\ = X2- We can construct a smaller model m x = m x — {aq} and 
a mapping m_m' x , in which X2 is substituted by aq. m' x and m_m' x satisfy 
(i)-(ii), but m x < m’ x does not hold. This yields a contradiction to (iii), 
so that our assumption is false and ind(yi,y2,m_m') does not hold. We 
have shown that all conditions (l)-( 3 ) stated in Theorem 4 . 2.1 follow from 
Definition 4 . 2 . 3 . 

(■<— ) Now we prove the reverse. Let conditions (l)-( 3 ) hold. Trivially, if ( 1 ) 
then (i). Conjunction of ( 2 ) and ( 3 ) is obviously more restrictive than 
condition (ii), so (ii) holds. We show the minimality condition (iii) using 
the following approach. First, we establish a lower bound on the number of 
instances that m x must have as \m x \ > k using conditions (i) and (ii). Then, 
we show that if (l)-( 3 ) are satisfied, then m x necessarily has k instances, 
so it is a minimal model with (i)-(ii). 

The lower bound is established by condition (ii). Recall that ind {., ., m_m') 
is an equivalence relation. It yields a disjoint decomposition II of instances 
in Domain (m_m / ), i.e., of all instances of m that are connected in m_m! . 
By condition (ii), each equivalence class c £ II must be associated with a 
distinct instance in m x . The proof is by contradiction: let ci, C2 £ II, 2/1 € 
ci, 2/2 € C2, and (yi,x), (2/2,2;) £ m_m x , i.e., instance x is shared among 
Ci and C2 ■ Then, by condition (ii), ind{y\,y2,m m!) holds and thus Ci, C2 
are not disjoint - we obtained a contradiction. That is, \m x \ > k = \II\. 
Now, we demonstrate that conditions (l)-( 3 ) imply \m x \ = | 77 |. By 
condition ( 3 ), if (222,2:2) £ m_m x and ind(yi,y2,m_m'), then 

X\ = X2 ■ That is, all instances from the same equivalence class c £ II 
must be associated with an identical instance x £ m x . By condition ( 2 ), 
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Domain(TO_m x ) = Domain i.e. , each instance of m x is associated 
with some y G c, c G 77 . Hence, \m x \ < |il|. We already have \m x \ > | 77 |, 
since (i)-(iii) imply (l)-( 3 ). Therefore, \m x \ = | 77 |, and (i)-(iii) from Defi- 
nition 4 . 2.3 follow from (l)-( 3 ) of Theorem 4 . 2 . 1 . 



B.2 Merge Operator 

Lemma B. 2 . 1 . Let m\_m2 C mi x m2. = 

Merge( y mi,rri2 1 mi_m2) holds if and only if the conditions (i)-(iii) of Defini- 
tion 4-2-4 are satisfied and 

— Si = m — Domain(m_rri2 ) = mi — Domain{in\_m2) , 

— S2 = m — Domain(m_mi) = m2 — Range(rn\_m2), 

— S3 = Domain(m_mi) n Domain(m_m2 ) = toi_TO2 

for disjoint partitions Si, S2, S3 of m, S± U ^2 U S3 = m. ■ 

Proof: We show that each model with (i)-(iv) can be partitioned as sug- 
gested in the proposition. The partitioning determines m uniquely up to 
isomorphism for fixed mi, m2, and mi_m.2. This implies that each model m' 
with (i)-(iii) that is isomorphic to m is minimal and so satisfies (iv). 

Let conditions (i)-(iv) hold. We partition m into four sets, So, S±, S2, S3: 

— So = {z | z 6 m and 0 ^ Domain(m_mi) and z qL Domain(m_m.2)}. 

— Si = {z | z € m and z G Domain(m_mi) and 2 ^ Domain(m_m2)}. 

— S2 = {z \ z G m and z ^ Domain(m_mi) and z £ Domain(m_m2)}. 

— S 3 = {z \ z G m and z G Domain(m_mi) and z G Domain(m_m.2)}. 

By construction, So, Si, S 2, S3 are pairwise disjoint with So U Si US^USs = 
m. First, we show that So = 0 . Let z G m and z fL Domain(m_mi) and 
2 ^ Domain(m_m2). Then, obviously the model m! = m — {S} with the same 
mi_m, mo m satisfies (i)-(iii) and thus (iv) is violated. By contradiction, 
there is no z with these properties, and So = 0. 

Due to condition (iii), Domain(m_mi) C m and Domain(m_m2) C m. 
Therefore, we can simplify the definitions of Si, S2 and S3 as follows: 

— Si = to — Domain(m_TO2) 

— S2 = m — Domain(m_mi) 

— S3 = Domain(m_?ni) n Domain(m_m2). 

Let a; £ mi — Domain(mi_TO2). Then, x G Range(m_mi) and there exists 
z with (z,x) G m_mi . Since Domain(m_mi) is fully contained in m by con- 
dition (iii), so z G m. Assume that there exists y with (z,y) G m,_m2. 
Then, by (ii), (x,y) G mi_m2 and x G Domain(mi_m2). We arrived at a 
contradiction. Hence, our assumption was false and z Domain(m_m2). 



224 B. Proofs of Simplification Theorems 



Now, assume that there exists z' ^ z with (z',x) G m_m\. We construct 
m! = m—{z'} and m'_m i = m_mi — {(z',x)}. Since z' (/L Domain(m_TO2) and 
x ^ Domain(mi_TO2), removing ( z',x ) from to_toi preserves condition (ii). 
Conditions (i) and (iii) are satisfied trivially. Hence, we obtained a smaller 
model m! . By contradiction to (iv), we conclude that z is determined uni- 
quely. That is, there is a function from m-i — Domain(mi_m2) into m. But 
since m_m\ is a surjective function, then there is a bijection f\ C m\_m 
between toi — Domain(mi_m2) and Si = to — Domain(m_m2). 

Analogously, we show that there is a bijection fi C to ? m between TO2 — 
Range(mi_TO2) and S2 = m — Domain(m_mi). 

Finally, we demonstrate that S3 = TOi_TO 2- Let ( x,y ) G TOi_TO 2- By 
condition (ii) there exists 2: G Domain(TO_TOi) with ( z,x ) G to_to.i and 
( z,y ) G to_to- 2. By (iii), z G to. Now, let (21, x), (22, x) G to_to.i, and 
(21,?/), (z2,y) G m_TO2- Assume that 21 22. We construct to' = to — {22}, 

to'_toi = to_toi — {(22, x)}, and to 7 _to,2 = m_rri2 — {(22, y)}- By condition (i), 
there is no x 7 y^ x with (21, a/) G m!_m\ or (22, x 7 ) G to'_TOi, and there is no 
y' ^ y with (21,2/') G to/_TO 2 or ( 22,2/ 7 ) G to/_TO 2. Thus, m 7 , m'_mi, to/_TO 2 
satisfy (i)-(iii). Hence, we found a smaller model to 7 that satisfies (i)-(iii). This 
is a contradiction to (iv). Therefore, our assumption is false and 21 = 22, i.e. , 
(x,2/) determine 2 G to uniquely. Now, let 2 G S3. By condition (iii), there 
exists ( x,y ) G TOi_TO2 with (2, x) G to_TOi and (2,2/) G to_to-2. x and 2/ are 
determined uniquely by condition (i). That is, there is a bijection g between 
Toi to 2 and S3. Hence, S3 = toi_TO2- ■ 

Lemma B. 2.1 implies that the output model to in Merge is determined 
up to isomorphism. Thus, we can further simplify the definition of Merge as 
follows. 

Theorem 4.2.3 (from page 74). Let toi_TO 2 C toi xto 2 - (to, to_toi, 

= Merge(TOi, TO2, to.i_TO 2) holds if and only if 

— the conditions (i)-(iii) of Definition 4 . 2.4 are satisfied, and 

— |to| = mergeCard(TOi, to.2, toi_to. 2), where mergeCard(mi, m2, toi_TO 2) =df 

|mi TO2I + |toi — Domain(TOi_TO2)| + |to 2 — Range(TOi_TO2)|. 

If mi_m2 is total and surjective, or if m_TOi and to_TO 2 are total, then 
mergeCard(TOi, TO2, to.i_TO2) = |toi_TO2|. 

Proof: The sets Si, S'?, S3 of Lemma B. 2.1 constitute a disjoint de- 
composition of to. Thus, each model to with (i)-(iv) has exactly 
k = | Si | + | £>2 1 + | S3 1 instances. This implies that any other mo- 

del with (i)-(iii) has at least k instances. Thus, given a model to 7 
with (i)-(iii) that has k instances we can conclude that it is mini- 
mal and so satisfies (iv). By Lemma B. 2 . 1 , k = | S 7 ! | + | <S , 2 1 + | ^3 1 = 
|toi_TO2 I + | to — Domain(TO_TOi)| + | to — Domain(m_TO2)| = |mi_m2| + |toi — 
Domain(TOi_TO2)| + |m,2 — Range(TOi_m,2)| = mergeCard(mi, m2, TOi_TO2). 
If toi_TO 2 is total and surjective, or if m_m± and to_to- 2 are total, then 
5 i = S2 = 0 and hence k = |toi_TO2|. ■ 
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B.3 Diff Operator 

Lemma B.3. 1. Let Domain(m_m ') C to. ( md,m_m d ) = DifF[m,m_m') 
holds if and only if the following conditions are satisfied: 

1 . m_m d is a surjective function from m onto md- 

2. For all y\,y 2 £ Domain(m_m') with y\ ^ y 2 and ind(yi,y 2 ,m_m') there 
exist {yi,di), (y 2 ,d 2 ) £ m_m d with d 1 ± d 2 . 

3. If y £ m — Domain(m_jn'), then there exists ( y,d ) £ m_m d and 
W I (y\ d) £ m_m d } = {y}. 

f. m d is a minimal model with (l)-(3). ■ 

Condition (2) ensures that the instances ofm that are indistinguishable in 
m_m! become distinguishable in m_m d . Condition (3) requires each instance 
of m that does not participate in m_m' to have a counterpart in m d that is 
not connected to any other instance of m. It ensures that DifF picks up the 
instances ofm that get lost upon extraction. Condition (f) makes the result 
of Diff minimal. 

Proof: Conditions (iii) and (4) are identical. We show that conditions (i)-(ii) 
of Definition 4.2.5 are equivalent with (l)-(3). First, we rewrite the conditions 
(i)-(iii) by expanding the alternative definitions of Extract and Merge from 
Theorem 4.2.1 and Theorem 4.2.3 and removing tautologies. We obtain the 
following: 

a. m_m x and m_m d are surjective functions onto m x and m d , respectively. 

b. Domain (m_m x ) = Domain 

c. For all (j/i,xi), (y 2 ,x 2 ) £ m_m x : x\ = x 2 iff ind(yi,y 2 ,m_m'). 

d. m = Domain (m_m x ) U Domain {m_m d ). 

e. m x _m d = lnvert(m_m a: ) o m_m d . 

f. The statements below hold for pairwise disjoint partitions Si,S 2 ,Ss of 
m, Si U S 2 U S 3 = to: 

— Si = to — Domain (m_m d ) = m x — Domain (m x _m d ), 

— S 2 = to — Domain (m_m x ) = m d — Rang e(TO^TOd), 

— S3 = Domain (m_m x ) D Domain (m_m d ) = m^jmd- 

(—>) We show that conditions (l)-(3) of Lemma B.3.1 follow from (a)-(f). 
Let (a)-(f) hold. Condition (1) follows immediately from (a). Now, let 
y 2 ,yi £ Domain (m_m') with yi ^ y 2 and ind(yi , y 2 ,m_m'). By condition 
(b), yi,y 2 £ Domain(TO_TOa;). Therefore, there exist (yi,xi),(y 2 ,x 2 ) £ 
m_m x . Since ind(y\,y 2 ,m_m'), then due to condition (c), X\ — x 2 . That 
is, there exist (yi,x), {y 2 ,x) £ m_m x . Assume that y\ (jL Domain(TO_TOd), 
i.e. , y\ £ to — Domain(TO_TOd)- Then, by Theorem 4.2.3, y 1 is determined 
uniquely by x, a contradiction to y\ ^ y 2 . Thus, y\ £ Domain(TO_TOd). 
Analogously, y 2 £ Domai n(m_TOd). Let (3/1, di), (2/2, d 2 ) £ m_m d . Assume 
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that d\ = d ,2 = d. Then, by Theorem 4.2.3, (x, d) determine y uniquely, 
a contradiction to yi ^ y 2 - Hence, d\ and condition (2) holds. 

To show condition (3), let y £ m — Domain Then, by condition 
(b), y £ m — Domain(m_m x ). By Theorem 4.2.3, there exist uniquely 
determined d £ m d — Rang such that y is the only instance of 
m with (y,d) £ to_to<;. 

(<— ) We prove that conditions (l)-(3) of Lemma B.3.1 imply (a)-(f). Let 
(l)-(3) hold. Conditions (a)-(c), which come from Extract, determine m x 
and m_m x uniquely up to isomorphism. Let m x and m_m x be fixed with 
(a)-(c). Condition (1) also states that m_rrid is a surjective function onto 
771^, thus (a) is satisfied. 

We show condition (d): to — Domain (m_m x ) U Domain (m_rrid) = 
Domain (to_to/) U Domain (m_rrid)- Let y £ m. If y £ Domain 
then trivially m C Domain U Domain {m_md)- Let y £ t 
Domain By condition (3), there exists (y,d) £ m_rrid , the- 
refore y £ Domain (m_rrid)- By the assumption of Lemma B.3.1, 
Domain C m. By condition (1), Domain (m_md) C m. The- 
refore, Domain (m_md) U Domain (to_to/) C m, and the equality (d) 
follows. 

Finally, we show condition (f), which can be stated as 

— Si = m — Domain (m_md) — m x — Domain(lnvert(m_m x ) o m_md ), 

— 5*2 — m — Domain (m_m x ) = rrid — Range(lnvert(m_TO a; ) o m_md), 

— S 3 = Domain (m_m x ) D Domain (m_rrid) — lnvert(m_TO x ) o m_rrid- 

S\: Let y £ m— Domain Then, there are two possibilities: either 
y £ m — Domain (m_m x ) or y ^ m — Domain (m_m x ). In the former 
case, we obtain y £ m— Domain (m_m x ) = S 2 . This is a contradiction, 
since Si and S 2 are disjoint. Thus, y ^ m — Domain (m_m x ). In other 
words, y £ Domain (m_m x ). Hence, there exists ( y,x ) £ m_m x . Since 
m_m x is a function, x is determined uniquely by y. By (1), x £ m x . 
Assume that x £ Domain(lnvert(TO_m,j;) o m_md)- That is, there must 
exist (y, d) £ m_rrid for the composition not to drop x. Consequently, 
y £ Domain (m_md). This is a contradiction to our assumption y£m — 
Domain (m_rrid)- Therefore, x £ m x — Domain(lnvert(m_ma;) o m_rrid)- 
Now, let x £ m x — Domain(lnvert(m_m a; ) o m_md )■ That is, x ^ 
Domain(lnvert(m_TO x )om_TO ( /). By (1), x £ Range(m_m x ). Let (z,x) £ 
m_m x . By (d), z £ m. For the composition to fail, z ^ Domain(?n_TOd). 
Thus, z £ to— Domain(m_md). Assume that we have two such instances 
Z\ yf Z 2 with (zi,x), {z 2 ,x) £ m_m x . By (b), z £ Domain(TO_TO x ) = 
Domain(TO_TO , ). Hence, by condition (c), ind(z\, Z 2 ,m_m!) holds. But 
then, condition (2) applies and there exist (zi, di), (z 2 , d 2 ) £ m_rrid 
with di ^ d 2 - However, zi,z 2 ^ Domain (m_rrid)- Thus, we obtained a 
contradiction and z is determined uniquely. 

S 2 ' Let y £ m — Domain(TO_TOa,). By (b), y £ to— Domain(m_m/). Then, 
by condition (3) there exists a unique d £ Range(TO_md) with (y, d) £ 
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m_md ■ Assume that d £ Range(\nvert(rn_rn x )om_rrid) ■ Since m_irid is a 
function, d £ Domain (m_rrid) is determined uniquely by y. Therefore, 
there must exist ( y , x) £ m_m x for the composition to produce d. 
Thus, y £ Domain(TO_m x ). This contradicts our assumption, therefore, 
d £ rrid — Range(lnvert(m_m x ) o m_rrid )■ 

Now, let d £ rrid — Range(lnvert(m_m x ) o m_rrid )■ Since m_md is sur- 
jective, d £ Range(m_md). Thus, there exists (z, d) £ m_rrid with 
z Domain (m_m x ). By (b), z £ m — Domain (m_m'). Assume that 
we have two such instances, Z\ yf Zi- Since m_md is a function, they 
both map to d. This is a contradiction to (3). Thus, 2 : is determined 
uniquely. 

S 3 : Let y £ Domain (m_m x ) fl Domain (m_rrid)- Then, trivially, there 
exist (z,x) £ m_m x and (z, d) £ m_nid- By (a), m_m x and m_rrid 
are functions. Therefore, ( x , d) £ lnvert(m_m x ) o m_rrid is determi- 
ned uniquely. Now, let ( x,d ) £ Invert (m_m x ) o m_rrid- Thus, there 
exists z £ Domain (m_m x ) fl Domain (m_rrid) with ( z,x ) £ m_m x and 
(z,<7) £ m_m,d ■ Assume that ( z',x ) £ m_m x and ( z',d ) £ m_md- 
Then, we obtain a contradiction to (1). Hence, instance z is determi- 
ned uniquely. ■ 

Although Lemma B.3.1 simplifies Definition 4.2.5 substantially, the pre- 
sence of the minimality condition is still unsatisfactory. The following theorem 
substitutes the minimality condition by a precise lower bound and helps us 
to argue the correctness of the subsequent examples. The construction used 
in the proof of the theorem can be exploited to find a valid solution for Diff 
for concrete schema and mapping languages. 

Theorem 4.2.5 (from page 78). Let Domain C to. ( md,rn_rrid ) = 
Diff(m, m_m') holds if and only if conditions (l)-(3) of Lemma B.3.1 are satis- 
fied and \rrid\ = difFCard(m, where difFCard(m, m_m') =df rnax{\c\ : 

c £ 77 U 0, |c| yf 1} + |m — Domain(TO_m')| and 77 is a partitioning of 
Domain(TO_TO / ) by ind(., If m_m! is total, difFCard (m,m_m r ) = 

maa;{|c| : c £ 77 U 0, |c| yf 1}. 

Proof: Let 77 be a partitioning of Domain by ind (., ., m_m'). By con- 
dition (3) of Lemma B.3.1, raj contains a distinct instance for each y £ m — 
Domain(TO_TO / ). Moreover, {d \ ( y,d ) £ m_rrid and y £ m — Domain 
is disjoint with {<7 | (y,d) £ m_rrid and y £ U77}. Let c max be a maximal 
equivalence class of 77. Condition (2) requires to have a distinct instance 
for each y £ 77. If |c max | = 1, then there are no two distinct indistinguisha- 
ble instances in to, and condition (2) is satisfied trivially. Otherwise, (to^I > 

I Cmax | + | to — Doma in (to_to / ) | . Taking into account the case when |c max | = 1, 
we obtain [to^I > maa;{|c| : c £ 77 U {0}, |c| yf 1} + \m — Domain(TO_TO , )| = k. 

Next we prove by construction that there always exist m_rrid with Ito^I = 
k that satisfies (l)-(3). That will allow us to conclude that each rrid must 
have the cardinality of exactly k due to condition (4). We construct m_rrid 
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as follows. Notice that for each equivalence class c = {y \, . . . , y p } £ 77, there 
exist a total injective function f c :c—> c max , since c max is maximal. Let / 
be a mapping defined as / = U{/ c : c £ 77}. / is a surjective function onto 
Cmax- Let g be a bijection from m — Domain (m_m!) onto some set S , such that 
S'nRange(/) = 0. Now, let m_md = f£>g and nid = Range(m_Wd) = c max Li S. 
By construction, m_md is a surjective function with = k that satisfies 
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